Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I don't understand how AI scrapers make up such a large percentage of traffic to websites, as people claim it does.

I think a lot of people confuse scraping for training with on-demand scraping for "agentic use" / "deep research", etc. Today I was testing the new GLM-experimental model, on their demo site. It had "web search", so I enabled that and asked it for something I have recently researched myself for work. It gave me a good overall list of agentic frameworks, after some google searching and "crawling" ~6 sites it found.

As a second message I asked for a list of repo links, how many stars each repo has, and general repo activity. It went on and "crawled" each of the 10 repos on github, couldn't read the stars, but then searched and found a site that reports that, and it "crawled" that site 10 times for each framework.

All in all, my 2 message chat session performed ~ 5-6 searches and 20-30 page "crawls". Imagine what they do when traffic increases. Now multiply that for every "deep research" provider (perplexity, goog, oai, anthropic, etc etc). Now think how many "vibe-coded" projects like this exist. And how many are poorly coded and re-crawl each link every time...



Yeah it seems the implementation of these web-aware GPT queries lacks a(n adequate) caching layer.

Could also be framed as an API issue, as there is no technical limitations why search provider couldn't provide relevant snapshots of the body of the search results. Then again, might be legal issues behind not providing that information.


Caching on client-side is an obvious improvement, but probably not trivial to implement at provider-level (what do you cache, are you allowed to?, how do you deal with auth tokens (if supported), when searching a small difference might invalidate cache, and so on).

Another content-creator avenue might be to move to a 2-tier content serving, where you serve pure html as a public interface, and only allow "advanced" features that take many cpu cycles for authenticated / paying users. It suddenly doesn't make sense to use a huge, heavy and resource intensive framework for things that might be crawled a lot by bots / users doing queries w/ LLMs.

Another idea was recently discussed here, and covers "micropayments" for access to content. Probably not trivial to implement either, even though it sounds easy in theory. We've had an entire web3.0 hype cycle on this, and yet no clear easy solutions for micropayments... Oh well. Web4.0 it is :)


A caching layer sounds wonderful. Improves reliabiltity while reducing load on the original servers.

I worry that such caching layers might run afoul of copyright, though :(

Though an internal caching layer would work, surely?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: