That would work, but I'd really prefer not to force users to run JavaScript, break RSS readers and slow down page loads (round trips are expensive). Adding a link maze to a random corner of the site doesn't impact users at all.
Not to me, but I've known people who have had their sites DDoSed out of existence by the scrapers. On the internet, it's often the smallest sites with the smallest budgets that have the best content, and those are hit the worst.
> They do provide source for material if users asks for it
Not for material they trained on. Those sources are just google results for the question you asked. By nature, they cannot cite the information gathered by their crawlers.
> You still need to pay for the traffic
It's so little traffic my hosting provider doesn't bother billing me for it.
> and serving static content (like text on that website) is way less CPU/disk expensive than generating anything.
Sure, but it's the principle of the thing: I don't like when billion dollar companies steal my work, and then use it to make the internet a worse place by filling it with AI slop/spam. If I can make their lives harder and their product worse for virtually no cost, I will.
The problem is that believable content doesn't compress well. You aren't going to get anywhere close to that 1:1000 compression ratio unless it's just a single word/character repeated thousands of times.
It's a choice between sending them some big files that will be filtered out long before they can do any real damage or sending them nonsense text that might actually make it's way into their training data.
It's often one IP (v4!) per one request. It's insane how many resources are being burned on this stupidity.
Part of the reason I did this is to get good numbers on how bad the problem is: A link maze is a great way to make otherwise very stealthy bots expose themselves.
Even if this is true how long can that be sustained before they start to be recycled? I bet the scrappers make a whole lot more requests than they have IPs
Sorry about that, stupid mistake on my side. I've fix the version on the server, an you can just edit the line to "pthread_detach(thread);" The snprintf() is only part of a status page, so you can remove it if you want.
As for the threads, that could be an issue if directly exposed to the internet: All it would take for an attacker to open a whole a whole bunch of connections and never send anything to OOM the process. However, this isn't possible if it's behind a reverse proxy, because the proxy has to receive all the information the needs server before routing the request. That should also filter out any malformed requests, which while I'm fairly sure the parser has sane error handling, it doesn't hurt to be safe.
Not sure if I agree with you on the thread exhaustion issue. The client can still send a flood of correctly-formed requests; the reverse proxy will pass them all through. As I said above, yes, the fact that babble processes requests so quickly would make this harder, but you could still end up with (tens of?) thousands of concurrent requests if someone is really determined to mess with you.
A solution could be to limit concurrent requests in the reverse proxy, but personally I prefer to write software that doesn't require another piece of software, configured correctly, to keep it safe.
And regardless, even with ~25 years of C experience under my belt, I don't think I'd ever be wholly comfortable exposing my C code to the internet, even behind a reverse proxy. Not coming at you directly with this, but I'm frankly skeptical of anyone who is comfortable with that, especially for a one-off service that won't see a lot of use and won't get a lot of eyeballs on it. (And I'm especially uncomfortable with the idea of posting something like this on a website and encouraging others to use it, when readers may not understand the issues involved.)
And yes, there is inherent risk with exposing any service to the internet. That goes for any program, written in any language (remember Log4Shell?) doing any task.
I continuously encourage others to do exactly this. It is a great learning opportunity. If they are not aware that they will get DoS'd now they will know. It's not like they will get PTSD from having to wait for OOM killer or losing their vps. You learned it that way, I learned it that wat, why not others? At least this way they will have real experience under their belt, not some online diatribe.
2. Wait for request <--- Attack causes us to get stuck here
3. Serve request
4. Close connection and thread / return to threadpool
Solution: Use a reverse proxy to handle the incoming connections. Typical reverse proxies such as nginx use event-based polling not a per-connection thread so they are immune to this issue.
The way you deal with this is that you write the server to be async I/O based with NPROC threads, not a thread-per-client design, and then you can use CPS for the business logic, but in this case it's so trivial... You can probably get by with just a handful of bytes of memory pressure per client in the app + whatever the per-client TCB is for the TCP connection for a total of less than 200 bytes per client.
I don’t know if people are noticing this but apple.com doesn’t have a cookie banner. It’s perfectly possible to operate a website – even a shop – without having a cookie banner. Even one of the biggest here in Europe in terms of revenue. As OP said, cookie banners are just malicious compliance. You don’t need one unless you’re doing shady things. Unfortunately it looks like the advertisers and trackers are winning as the EU is planing to relax the rules. I believe there would have been another way, something like banning unnecessary tracking altogether.
And cookie banners are what exactly? That's right, you ask users to consent to use cookies to track them. Thus, cookie banners are REQUIRED because you can't run a business without any insight into how your users use your product.
Except it's not tracking. Remembering user preferences is the original goal of cookies, and doesn't come with any legal requirements.
The law is (paraphrasing) "You must use cookies or similar to be evil without permission". Advertising companies decided that instead of not being evil, they'd annoy users into giving permission.