pleroma.debian.social

pleroma.debian.social

Oh, look, a rabbit hole.

For a bit of context: I've been playing with go-away lately, trying to move the passive detection I built in Caddy to it. Progress is being made.

But then I came accross a thing that feels like it would be even closer to what I had in mind. Different tradeoffs, but ones that feel less pricey, perhaps.

So now I'm down the rabbit hole of figuring out whether my gut feeling is right.

What if...

example.com {
  iocaine /run/iocaine.socket
  reverse_proxy whatever
}

In other words, what if iocaine (or a separate service, but in this case, iocaine itself feels more practical) could do the classification too?

And what if that classification was somewhat programmable?

I could then build a Caddy module that serializes the request headers, sends them over, and lets iocaine decide whether it wants to respond, or lets caddy pass through to the next handler.

"What do you mean by 'somewhat programmable'?"

filtermap main(request: Request) {
  if request.user_agent.contains("GPTBot") {
    accept garbage.generate(request)
  } else {
    reject
  }
}

So, first things first: lets build a small crate that lets me implement the classification rules I use now (but better), and see how convenient that is, and how fast.

Then figure out how to wrap that in a Caddy module.

Did I say iocaine /run/iocaine.socket?

How about this:

reverse_proxy iocaine:port {
  @fallback status 421
  handle_response @fallback {
    reverse_proxy whatever
  }
}

Looky. No Caddy module needed at all.

I like this rabbit hole. It is comfy. There are books here, ambient lighting, and a soothing deep voice is calling out to me to explore deeper.

It promises iocaine 3.0.

Or maybe 2.2, because there's nothing breaking here.

This feels like alien technology. My classification rules will be so much simpler, and more correct. And best of all:

EASILY SHAREABLE AND REUSABLE

Like, I can make a number of helper functions, collected in a package, and anyone can import that package, and use whichever functions they like.

Or they can just import my entire classification as-is.

Or whatever. The possibilities are endless.

Now I "just" need to verify this would work as I imagine it would.

Eh. Not going as smoothly as I had hoped. The idea is solid, though. But I might need to fiddle with some implementation details.

So, Big Reveal™, I guess: I've been looking at roto as a language to write classification rules for iocaine in.

There's some very neat things in there, but I'm having trouble working with strings.

So, I guess I'll go look for a language I can use for classification purposes, which I can embed in Rust. I don't have as high requirements as NLNet did for Roto.

I'm fine with dynamically typed languages, and it is okay if it is not the fastest. Chances are, it will be faster than my Caddyfile contraption no matter what. So what I'm looking for is a language that feels fine, and one where the embedding part is reasonable too.

What I am not looking for is to embed another language in Rust. PyO2, mlua and the like are not an option. I want a rust-y thing.

@algernon on the basis of the penultimate toot, I’d recommend looking at crm114. But it likely fails the criteria in the toot I’m replying to
replies
0
announces
0
likes
1

Currently looking at: Rhai, Rune, and Dyon.

Unsure about Dyon, feels a bit too complex for my tastes, and I found the documentation of Rhai and Rune more approachable. Down to two.

Rhai's docs say: "No first-class functions – Code your functions in Rust instead, and register them with Rhai". On the other hand: functions.

Rune feels more like the thing I'm looking for.

But these are just gut feelings. Based on how the embedded language looks and feels, all three would be acceptable. So I'll look at how to add them to iocaine, and see which one fits best.

I'll start with Rune.

...and I'm exploring Rhai instead, because the Rune book doesn't show a minimal "how to embed Rune" example. The first such example includes termcolor, and says it can be made simpler if that's not needed, but doesn't show how.

Rhai's documentation has an embedding guide. That helps a ton.

I think Rhai will work.

let user_agent = headers.get("user-agent");

user_agent.contains("GPTBot") ||
    (user_agent.starts_with("Mozilla/") && user_agent.contains("Chrome"))

If the script evaluates to true, iocaine will serve it garbage. If not, iocaine returns a 421, and the reverse proxy can serve the real stuff.

I now need a couple more helpers on the rust-side (regexp matching, for example), and then I can implement the "stdlib" on the Rhai side.

Why on the Rhai side? Because I want the stdlib to be separate, so that I can improve and release it independently, and an stdlib update shouldn't need an iocaine rebuild, just a restart (or later - once I implement it - reload).

Ooof. Initial implementation works, but if the classifier script is in play, there's a massive slowdown. From ~93k req/sec down to ~38k req/sec.

The good news is, if no classifier is set, the speed is indistinguishable from the speed of version 2.1.0.

Nevertheless, I'll try to make it a little faster.

Up to 50k req/sec now, and I'm not done yet.

Hrm. Not sure I can make it much faster tonight... that would require more lifetime juggling than I have the capacity for.

Hmm!

A simple classifier that always returns true (or false) is barely slower than running without one (~86k vs ~93k req/sec). So it might be the helper functions that slow things down.

That would make sense, actually. And I have ways to make them faster.

Hrm. Or maybe I won't... user_agent == "fasthttp" drops it down to ~50k req/sec. The regexp matcher (which I suspected to be the culprit) doesn't make much of a difference it seems.

OTOH, if I return false, and do not generate the garbage, speed goes up to 73k req/sec.

That... makes sense, too.

Time to sleep. I pushed the feature/scripting branch meanwhile.

There are no docs, nor tests, but it's reasonably straightforward: put a rhai script somewhere, set [server].classifier to its path, and you're done.

The script needs to return a bool (true to make iocaine generate garbage, false to make it signal the reverse proxy to serve the real stuff), and has access to the headers variable. It has a .get(name) method with which one can retrieve any header.

Strings also have a few extra methods on top of what Rhai provides out of the box: .matches(pattern), .index_of(pattern), and .capture(pattern, group_name).

They kinda do what you would expect.

These should be enough to re-build my Caddyfile classifier in Rhai. Then, I'll have to benchmark which one is faster, I guess. That will be a bit rough, but I'll figure something out.

If iocaines built-in classifier is faster, then I won't care much (for now) about speeding it up further.

I'm half-tempted to extend this further, and make it not just a classifier, but move some of the decision logic into this script, too.

Like, the templates are currently responsible for dispatching based on... whatever they wanna dispatch on. But what if it was the Rhai script that would do that?

If I want to make it possible to return custom HTTP statuses, perhaps even add custom headers and whatnot, then letting the Rhai script control more of iocaine would make sense.

BUT! That would likely slow things down considerably. I guess I'll stick to classification for now.

What I will have to do, though, is change the script response from bool to Verdict, where Verdict would be an enum-like thing.

Basically, I want to be able to send information to the reverse proxy, via response headers. So I need Verdict::Bad(REASON) and Verdict::Good(REASON). What REASON should be, I don't know yet. Maybe just a string.

But this is a problem for after-sleep. Or during-sleep. One of those two.

...and I have a few ideas for the stdlib, too.

Great fun will be had tomorrow! Can't wait!

Small optimizations were made, speed is up to 51k req/sec. Still considerably slower than the ~93k without the classifier script. Will have to compare it against the Caddy-based classifier soonish.

I also implemented a verdict module, so the script can return true / false or verdict::GOOD / verdict::BAD, or even verdict::good("reason") / verdict::bad("reason"). If a reason is given, iocaine will send it upstream in the x-iocaine-reason header.

Next steps:

  1. Tests
  2. Benchmarks
  3. Benchmarks against Caddy
  4. Commit history cleanup
  5. Merge into main

running 7 tests
test means_of_production::test::test_classifier_regex_error ... ok
test means_of_production::test::test_classifier_bad_return_type ... ok
test means_of_production::test::test_classifier_static_verdict_const ... ok
test means_of_production::test::test_classifier_static_bool ... ok
test means_of_production::test::test_classifier_static_verdict_with_reason ... ok
test means_of_production::test::test_classifier_header_get ... ok
test means_of_production::test::test_classifier_regexps ... ok

test result: ok. 7 passed; 0 failed; 0 ignored; 0 measured; 12 filtered out; finished in 0.01s

flan_brick

     Running benches/classifier.rs (target/release/deps/classifier-fe88b2cd73d9304d)
Timer precision: 30 ns
classifier             fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ classifier_branchy  164.1 µs      │ 554.2 µs      │ 169.2 µs      │ 173 µs        │ 28795   │ 28795
╰─ classifier_static   177.5 ns      │ 12.71 µs      │ 183.8 ns      │ 186.7 ns      │ 1591640 │ 25466240

flan_brick

Although, now I wonder what makes classifier_branchy so slow here. Might look into that later.

Ok, now comes the hard part: benchmarking against Caddy.

Initial results are unfortunately not very promising, though the classification script I'm testing right now is rather primitive, a single regexp, rather than the complicated mapping stuff I do on Eru.

regexp matching via Rhai is slow, though, that's for sure.

I guess I'll have to benchmark with my entire classification engine, to have fair numbers.

I has a sad. Even when using my elaborate Caddyfile-based classifier, that's almost twice as fast as even the most trivial iocaine-based classifier.

I suspect that serializing & deserializing all headers all the time adds considerable overhead.

And I discovered another downside: if iocaine is doing the classification, then implementing rate limits for the maze alone becomes a whole lot more complicated.

On the flip side, the iocaine-classifier is far more expressive. It's just a tad slow.

You know what... I'll try to rewrite my classifier in Rhai anyway, to see how much slower that is.

That'll give me more datapoints about where to optimize, too.

Yikes. We're entering 4k req/sec territory. That is prohibitively slow, unfortunately.

There's a non-negligible Caddy overhead here (about 1k req/sec, slightly less, maybe), but bombardiering the classifier-enabled iocaine is still slow as a snail.

Maybe I should profile it... but I'll go with a gut feeling first.

An annoying part of this benchmarking stuff is that compiling iocaine in release mode takes an eternity and a half.

sigh

Unfortunately, gut feeling was wrong. I figured that passing all of the headers into the Rhai scope would be costly, so I tried passing the user agent only - but that didn't make a difference as far as speed goes.

At this point, I can't switch, even if the scripting interface is more expressive, because this kind of slowing down is not acceptable.

So on to the next idea: what if I used something else than Rhai? I can give Roto another try.

I have a slightly better understanding how it all fits into iocaine, so maybe I can make that work.

@algernon I admire your ability to go down a wrong path, spend 2 fays of effort on it, realize it won't work, and try again without rage quitting

@wolf480pl There's no reason to rage: I verified that the idea works, and that I can let iocaine do the classification without having to write a Caddy module. This is a huge leap forward, because I've been trying to do this for the past month.

Yes, the language I chose to embed ended up being prohibitively slow. But that's a tiny implementation detail. The bigger take away is that the idea works, and I don't need a Caddy module. That concludes a month of pondering!

I do need to find something faster, but... two days compared to the past month is a drop in the bucket. I made huge progress! I'm gonna drink a celebratory coffee. flan_coffee

@algernon hmm okay

But if you know the idea works, what motivates you to keep going?

@algernon well ok I guess there's the suspense of "will I be able to get >50k req/s out of it"

@wolf480pl @algernon why would you stop at this point?

@wolf480pl I want to replace my Caddyfile-based classification, because while it works, there's multiple problems with it:

  • It is almost completely unshareable.
  • There are things I can't do within the constraints of a Caddyfile, which I could if I used a more suitable programming language.

The motivation is to move the classification out of Caddyfile, and make it at least as performant as that was, because that lets me do things I can't do now, and also lets me share my setup more easily, in a way that people can mix and match various pieces together.

@algernon while I see how scripting can be a fun engineering challenge, I'd be better served as a user by compatibility with anubis' rulesets syntax:

  • if you released any scripting capabilities, I'd just use it to translate these rules
  • pure-rust evaluation of a limited set of conditions should perform better
  • I cannot think of a lot of use-cases that cannot be matched by a config file

@xavier Various importers are next on the roadmap =)

The reason I'm aiming at an embedded language is because I'd like to write rules like this:

if user_agent.index_of("Chrome/) > user_agent.index_of("AppleWebKit/) {
  return Verdict::BAD
}

...and subtleties like this. Yes, this can be expressed in a yaml config file too, but that quickly ends up in a yaml soup, a huge primordial mess, much like my current Caddyfile.

My gut's telling me that I can achieve acceptable speeds with an embedded language too, and can save myself from yaml. If that's doable, I can still bolt various translators on top to make the user experience nicer too.

But if I fail to achieve good speeds with a language, I will fall back to Anubis-style rulesets.

@baturkey @algernon because the mystery has been solved

@wolf480pl @baturkey That's only part of it, though. While the mystery is solved, the goal has not been achieved yet.

Also, someone stole my brakes. And I got high from inhaling too much Rust. So I'm just standing here, flailing my arms while iocaine compiles in release mode, and every time minor progress is made I'll do a little dance.

(If you're questioning my sanity, please tell it to come home, I haven't seen it in a few decades.)

Initial hackery with Roto: incredibly naive implementation that rejects everything, and rebuilds the entire runtime on each request: ~25k req/sec (in release mode).

Same thing but building the runtime once, and then just calling the function for each request: 49k req/sec in debug mode. 277k req/sec in release mode.

This is promising. But the question remains: can I build it in a way that lets me implement the rules I want? That's where I bled out last night when I first tried Roto.

Strings are still a bit of a bitch with Roto.

PROGRESS!

TEHEHEHEHE.

filtermap main(request: Request) {
  let user_agent = request.header("user-agent");

  if user_agent.equals("curl/8.13.0") {
    reject "curl"
  } else {
    reject "not curl"
  }
}

74k req/sec in debug mode. The language is a bit more verbose, less nice than Rhai, but god damn, it is fast.

Huh, interesting. ~99k req/sec against the same script when bombardiering iocaine directly. ~23k req/sec when bombardiering through Caddy.

Just made sure, classification on that path is disabled. That's a significant overhead.

Disabled all other snippets: ~38k. Similar thing with the Caddyfile classifier: ~49k.

This is interesting, because I know that iocaine can be much faster: bombardiering it directly is ~99k. So this overhead is on the Caddy side.

http://localhost:38081 {
	reverse_proxy 127.0.0.1:42069 {
		@fallback status 421
		handle_response @fallback {
			header content-type "text/plain"
			respond "ok" 200
		}
	}
}

This should not incur such overhead.

I'll deal with Caddy later. There's still some work left to do on the iocaine side.

Right. Implemented regexp matching, a simple thing that brought Rhai down to ~9k is ~92k req/sec with Roto, when attacking iocaine directly.

The iocaine side of this will be good enough, based on these observations. The next part (after I implemented all the things I need for the classifier) will be to figure out if I can make Caddy faster.

Because if my classifier is faster, but the Caddy overhead is slower than the speed gain, then I'm still fucked.

The good news is: I have a few ideas.

But time for bedtime stories now.

     Running benches/classifier.rs (target/release/deps/classifier-d294cf21366ac898)
Timer precision: 20 ns
classifier             fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ classifier_branchy  76.3 µs       │ 419.7 µs      │ 79.67 µs      │ 82.09 µs      │ 60590   │ 60590
╰─ classifier_static   31.21 ns      │ 3.773 µs      │ 56.74 ns      │ 49.3 ns       │ 1482685 │ 94891840

Nice. Branchy is up from about 28.7k samples in 5s to 60k.

This is just the decision making, no garbage generation involved.

First idea worked, somewhat, we're up to 41k req/sec. Still too slow, though.

I mean, iocaine itself is pretty fast. But the reverse_proxy + handle_response part seems to be slowing things down considerably.

D'oh.

I screwed up the rules, and the tests weren't fair! The Caddyfile test was serving good content to bombardier, the iocaine classifer was serving garbage. So I had the garbage generation tax on top of it.

Fixed the rules, and things feel a bit better now, up to 52k req/sec, ~2-3k faster than the Caddyfile classifier.

Mind you, this isn't my entire classification system ported yet, so I expect it will go down a little. But this is finally, finally, finally looking viable.

I think I managed to port over my classifier. Lets see how it works...

called `Option::unwrap()` on a `None` value

Err. Oops. But it gets worse!

thread 'tokio-runtime-worker' panicked at core/src/panicking.rs:221:5:
unsafe precondition(s) violated: ptr::copy_nonoverlapping requires that both pointer arguments are aligned and non-null and the specified memory ran
ges do not overlap
stack backtrace:
thread 'main' panicked at cargo-auditable/src/cargo_auditable.rs:40:39:
called `Option::unwrap()` on a `None` value

I might have fubared something up a little badly.

Seems to be related to the unix socket stuff. Can't repro on TCP sockets.

Hrm, repro'd on TCP too, but only if proxied through Caddy. Huh.

...nah. bombardier blew it up too.

Ok. Lets walk back a bit.

Interesting. It's not the unix listener stuff. Ok. This will be some fun debugging I guess.

OOOF.

Massive L. Managed to make things work, somewhat, and the big ai.robots.txt regexp slows things down to ~10k req/sec.

Not fine. But I have an idea.

Rewrote the ai.robots.txt matching to use .contains(), no regexps. 60k req/sec in debug mode.

So the regexps are... a Problem.

I tried pre-compiling the regexp in Rust, but that leads to crashes all over the place.

thread 'tokio-runtime-worker' panicked at /home/algernon/.cargo/registry/src/index.crates.io-6f17d22bba15001f/regex-syntax-0.8.5/src/hir/mod.rs:2001:9:
misaligned pointer dereference: address must be a multiple of 0x8 but is 0x3a3a736e69746c69

Hm.

While a single Regex can be freely used from multiple threads simultaneously

(Source)

New idea: I'll keep regexp support, but will rewrite my rules to not rely on regexps. Most of them don't need to be regexps anyway.

And, I can maybe leverage aho_corasick... now that'd be a nice big win.

Please tell me my gut feeling is wrong.

Phew. My gut was wrong. Not completely wrong, but wrong nevertheless.

I tried reproducing the regexp crash without jemalloc - the gut feeling was that the regex crate and jemalloc don't play well. But no, that's not it!

❯ cargo run -q --no-default-features -- -c tmp/config/classifier.toml
free(): double free detected in tcache 2
thread 'main' panicked at cargo-auditable/src/cargo_auditable.rs:40:39:
called `Option::unwrap()` on a `None` value

This is without trying to share the Regex instance, too, and that feels a bit iffy.

Right. Sleepy time. New thread tomorrow.

A few walls were hit today, but I remain hopeful.