Now take that, and add someone in our Polish supermarket chain (Biedronka) having the dumb "insight" to disable "scan multiple" option. Until ~month ago, whenever buying something in larger quantity, I could just press "Scan multiple", tap in the amount, scan the barcode once, and move all the items of the same type to the "already scanned" zone. Now, I have to do it one by one, each time waiting for the scales to settle. Infuriating when you're buying some spice bag or candy and have to scan 12 of them one by one.
I'm guessing it's the HDD that's failing. Had such mysterious failures with my NVR (the Cloud Key thingie) from UniFi. Turns out, HDDs don't like operating in 60+ degree Celsius heat all the time - but SSDs don't mind, so fortunately the fix was just to swap the drive for a solid state one.
Yup. The problem was never with the technology replacing work, it was always with the social aspect of deploying it, that ends up pulling the rug under people whose livelihood depend on exchanging labor for money.
The luddites didn't destroy automatic looms because they hated technology; they did it because losing their jobs and seeing their whole occupation disappear ruined their lives and lives of their families.
The problem to fix isn't automation, but preventing it from destroying people's lives at scale.
Influencers are, by definition, advertisers - and a particularly insidious, ugly bunch at that.
If we go by the vibe of this thread, it's yet another reason to avoid social media. You wouldn't want to reward people like this.
As for the broader topic, this segues into the worryingly popular fallacy of excluded middle. Just because you're not against something, doesn't mean you're supporting it. Being neutral, ambivalent, or plain old just not giving a fuck about a whole class of issues, is a perfectly legitimate place to be in. In fact, that's everyone's default position for most things, because humans have limited mental capacity - we can't have calculated views on every single thing in the world all the time.
There isn't. There never was one, because vast majority of websites are actually selfish with respect to data, even when that's entirely pointless. You can see this even here, with how some people complain LLMs made them stop writing their blogs: turns out plenty of people say they write for others to read, but they care more about tracking and controlling the audience.
Anyway, all that means there was never a critical mass of sites large enough for a default bulk data dump discovery to become established. This means even the most well-intentioned scrappers cannot reliably determine if such mechanism exist, and have to scrap per-page anyway.
> turns out plenty of people say they write for others to read
LLMs are not people. They don't write blogs so that a company can profit from their writing by training LLMs on it. They write for others to read their ideas.
LLMs aren't making their owners money by just idling on datacenters worth of GPU. They're making money by being useful for users that pay for access. The knowledge and insights from writings that go into training data all end up being read by people directly, as well as inform even more useful output and work benefiting even more people.
Except the output coming from an LLM is the LLM's take on it, not the original source material. It's not the same thing. Not all writing is simply a collection of facts.
Which is irrelevant if you're truly trying to "pay it forward".
That is the core of my observation: people claim to publish to benefit society, but push come to shove, they care more about getting credit and having oversight over who is benefiting, to the point of refusing to publish further (and sometimes unpublishing things) if that credit/control isn't given.
The problem isn't in wanting these things - it's in not being up-front about it.
> turns out plenty of people say they write for others to read, but they care more about tracking and controlling the audience.
I couldn't care less about "tracking and controlling the audience," but I have no interest in others using my words and photos to profit from slop generators. I make that clear in robots.txt and licenses, but they ignore both.
Users are not being trained. Despite the seemingly dominant HN belief to the contrary, people use LLMs for interacting with information (on the web or otherwise) because they work. SOTA LLM services are just that good.
I'm not entirely sure why people think more standards are the way forward. The scrapers apparently don't listen to the already-established standards. What makes one think they would suddenly start if we add another one or two?
There is no standard, well-known way for a website to advertise, "hey, here's a cached data dump for bulk download, please use that instead of bulk scraping". If they were, I'd expect the major AI companies and other users[0] to use that method for gathering training data[1]. They have compelling reasons to: it's cheaper for them, and cultivates goodwill instead of burning it.
This also means that right now, it could be much easier to push through such standard than ever before: there are big players who would actually be receptive to it, so even few not-entirely-selfish actors agreeing on it might just do the trick.
--
[0] - Plenty of them exist. Scrapping wasn't popularized by AI companies, it's standard practice of on-line business in competitive markets. It's the digital equivalent of sending your employees to competing stores undercover.
[1] - Not to be confused with having an LLM scrap specific page for some user because the user requested it. That IMO is a totally legitimate and unfairly penalized/villified use case, because LLM is acting for the user - i.e. it becomes a literal user agent, in the same sense that web browser is (this is the meaning behind the name of "User-Agent" header).
You do realize that these AI scrapers are most likely written by people who have no idea what they're doing right? Or they just don't care? If they were, pretty much none of the problems these things have caused would exist. Even if we did standardize such a thing, I doubt they would follow it. After all, they think they and everyone else has infinite resources so they can just hammer websites forever.
I realise you are making assertions for which you have no evidence. Until a standard exists we can't just assume nobody will use it, particularly when it makes the very task they are scraping for simpler and more efficient.
A lot of the internet is built on trust. Mix in this article describing yet another tragedy of the Commons and you can see where this logically ends up as.
Unless we have some government regulating the standard, another trust based contract won't do much.
> I realise you are making assertions for which you have no evidence.
We do have evidence, which is their current behavior. If they are happy ignoring robots.txt (and also ignoring copyright law), what gives you the belief that they magically won't ignore this new standard? Sure, it in theory might save them money, but if there's one thing that I think is blatantly obvious it is that money isn't what these companies care about because people just keep turning on the money generator. If they did care about it, they wouldn't be spending far more than they earn, and they wouldn't be creating circular economies to try to justify their existences. If my assertion has no evidence, I don't exactly see how yours does either, especially since we have seen that these companies will do anything if it means getting what they want.
Simpler and efficient for who? I imagine some random guy vibe coding "hi chatgpt I want to scrape this and this website", getting something running, then going to LinkedIn to brag about AI. Yes I have no hard evidence for this, but I see things on LinkedIn.
That's not the problem being discussed here, though. That's normal usage, and you can hardly blame AI companies for shitty scrapers random users create on demand, because it's merely a symptom of coding getting cheap. Or, more broadly, the flip side of the computer becoming an actual "bicycle for the mind" and empowering end-users for a change.
Reality doesn't have a distinction between "code" and "data"; those are categories of convenience, and don't even have a proper definition (what is code and what is data depends on who's asking and why). Any such distinction requires mechanically enforcing it; AI won't have it, because it's not natural, and adding it destroys generality of the model.
Haha. But DNA is a very good example of what I'm talking about. It's both "code" and "data" at the same time - or rather, a perfect demonstration that these concepts don't exist in nature.
I get the joke, but it's also an incredibly interesting topic to ponder. Remember "Reflections on Trusting Trust"? Now consider that DNA itself needs a complex biomolecular machine to "compile" it into cells and organisms, and that this also embeds in them copies of the "compiler" itself. This raises the question of whether, and how much, information needed to build the organism is not explicitly encoded anywhere in the DNA itself, and instead accumulates in the replication mechanism and gets carried over implicitly.
So for you to successfully use my DNA as code, without also borrowing the compiler from my body, would be a major scientific result, shining light on the questions outlined above.
So in short: I'm happy to contribute my DNA if you cite me as co-author on the resulting paper :P.
> The problem is, so much of what people want from these things involves having all three.
Pretty much. Also there's no way of "securing" LLMs without destroying the quality that makes them interesting and useful in the first place.
I'm putting "securing" in scare quotes because IMO it's fool's errand to even try - LLMs are fundamentally not securable like regular, narrow-purpose software, and should not be treated as such.
reply