Hacker Newsnew | past | comments | ask | show | jobs | submit | maurycyz's commentslogin

That would work, but I'd really prefer not to force users to run JavaScript, break RSS readers and slow down page loads (round trips are expensive). Adding a link maze to a random corner of the site doesn't impact users at all.


> Do they do any harm

Not to me, but I've known people who have had their sites DDoSed out of existence by the scrapers. On the internet, it's often the smallest sites with the smallest budgets that have the best content, and those are hit the worst.

> They do provide source for material if users asks for it

Not for material they trained on. Those sources are just google results for the question you asked. By nature, they cannot cite the information gathered by their crawlers.

> You still need to pay for the traffic

It's so little traffic my hosting provider doesn't bother billing me for it.

> and serving static content (like text on that website) is way less CPU/disk expensive than generating anything.

Sure, but it's the principle of the thing: I don't like when billion dollar companies steal my work, and then use it to make the internet a worse place by filling it with AI slop/spam. If I can make their lives harder and their product worse for virtually no cost, I will.


The problem is that believable content doesn't compress well. You aren't going to get anywhere close to that 1:1000 compression ratio unless it's just a single word/character repeated thousands of times.

It's a choice between sending them some big files that will be filtered out long before they can do any real damage or sending them nonsense text that might actually make it's way into their training data.


It's often one IP (v4!) per one request. It's insane how many resources are being burned on this stupidity.

Part of the reason I did this is to get good numbers on how bad the problem is: A link maze is a great way to make otherwise very stealthy bots expose themselves.


Even if this is true how long can that be sustained before they start to be recycled? I bet the scrappers make a whole lot more requests than they have IPs


Sorry about that, stupid mistake on my side. I've fix the version on the server, an you can just edit the line to "pthread_detach(thread);" The snprintf() is only part of a status page, so you can remove it if you want.

As for the threads, that could be an issue if directly exposed to the internet: All it would take for an attacker to open a whole a whole bunch of connections and never send anything to OOM the process. However, this isn't possible if it's behind a reverse proxy, because the proxy has to receive all the information the needs server before routing the request. That should also filter out any malformed requests, which while I'm fairly sure the parser has sane error handling, it doesn't hurt to be safe.


> Sorry about that, stupid mistake on my side. I've fix the version on the server, an you can just edit the line

Chant with me:

    -Werror=all -Werror=extra -pedantic
Chant with me.

Also, stop using C. Use C++. You can use it just like C, but you can also learn some of the guardrails that C++ provides.


Not sure if I agree with you on the thread exhaustion issue. The client can still send a flood of correctly-formed requests; the reverse proxy will pass them all through. As I said above, yes, the fact that babble processes requests so quickly would make this harder, but you could still end up with (tens of?) thousands of concurrent requests if someone is really determined to mess with you.

A solution could be to limit concurrent requests in the reverse proxy, but personally I prefer to write software that doesn't require another piece of software, configured correctly, to keep it safe.

And regardless, even with ~25 years of C experience under my belt, I don't think I'd ever be wholly comfortable exposing my C code to the internet, even behind a reverse proxy. Not coming at you directly with this, but I'm frankly skeptical of anyone who is comfortable with that, especially for a one-off service that won't see a lot of use and won't get a lot of eyeballs on it. (And I'm especially uncomfortable with the idea of posting something like this on a website and encouraging others to use it, when readers may not understand the issues involved.)


> The client can still send a flood of correctly-formed requests

This is possible with any server. It's a known exploit and very difficult to fully mitigate: https://en.wikipedia.org/wiki/Denial-of-service_attack Whatever you do, they can always overwhelm your network connection.

And yes, there is inherent risk with exposing any service to the internet. That goes for any program, written in any language (remember Log4Shell?) doing any task.


I continuously encourage others to do exactly this. It is a great learning opportunity. If they are not aware that they will get DoS'd now they will know. It's not like they will get PTSD from having to wait for OOM killer or losing their vps. You learned it that way, I learned it that wat, why not others? At least this way they will have real experience under their belt, not some online diatribe.


Thread exhaustion attack

1. Start <thread_count> connections to a server

2. Hold connections open

3. Do nothing else

Server

1. Incoming connection. assign a thread.

2. Wait for request <--- Attack causes us to get stuck here

3. Serve request

4. Close connection and thread / return to threadpool

Solution: Use a reverse proxy to handle the incoming connections. Typical reverse proxies such as nginx use event-based polling not a per-connection thread so they are immune to this issue.


The way you deal with this is that you write the server to be async I/O based with NPROC threads, not a thread-per-client design, and then you can use CPS for the business logic, but in this case it's so trivial... You can probably get by with just a handful of bytes of memory pressure per client in the app + whatever the per-client TCB is for the TCP connection for a total of less than 200 bytes per client.


You didn't actually address the concerns I laid out. And I acknowledged that a reverse proxy, appropriately configured, could mitigate the issue.


My bad. It's fixed now. (and yes, the gcc suggested fix is the right one.)


The law does not mandate cookie banners. Cookie popups are malicious compliance by advertising and analytics companies to continue spying.

The real solution is to tighten what counts as consent.


I don’t know if people are noticing this but apple.com doesn’t have a cookie banner. It’s perfectly possible to operate a website – even a shop – without having a cookie banner. Even one of the biggest here in Europe in terms of revenue. As OP said, cookie banners are just malicious compliance. You don’t need one unless you’re doing shady things. Unfortunately it looks like the advertisers and trackers are winning as the EU is planing to relax the rules. I believe there would have been another way, something like banning unnecessary tracking altogether.


It does mandate cookie banners. The client I worked for on the last project got fined because they were missing a cookie banner.

The solution is much simpler - ban targeted ads. The entire purpose of collecting user data is to deliver targeted ads.


Nah, they were fined for tracking users without consent.


And cookie banners are what exactly? That's right, you ask users to consent to use cookies to track them. Thus, cookie banners are REQUIRED because you can't run a business without any insight into how your users use your product.


Except it's not tracking. Remembering user preferences is the original goal of cookies, and doesn't come with any legal requirements.

The law is (paraphrasing) "You must use cookies or similar to be evil without permission". Advertising companies decided that instead of not being evil, they'd annoy users into giving permission.


By setting billions of VC money on fire: https://en.wikipedia.org/wiki/OpenAI

No, really. They just have entire datacenters filled with high end GPUs.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: