Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Considering the bulk of data is now behind some form of paywall (ie you need a netflix subscription to access netflix data) I would be very surprised to see even bing+google crawlers combined consuming anywhere near 50% of traffic. They just don't have the access necessary to start pulling such numbers.


I don’t quite understand how or why but I was tangentially involved in a small-mid size e-commerce website that only serves B2B customers in the US and to a small extent in Canada and it absolutely got lots of hits from what I assume are Chinese search engines to the point where we decided we absolutely needed cloud flare rate limiting.

Now the code is a hot mess, sure and I am partially to blame for that but that is kind of besides the point. We don’t need to serve any more than thousands of concurrent users which the website can handle but we have to basically ban Chinese traffic to stay online.

Maybe people at bigger companies already know this but it was a revelation to me how much it takes just to stay alive in production.

I anal and I definitely don’t know what they get out of crawling every single product detail page on our website multiple times a day. Nothing here changes that often. Maybe they have some bad/overzealous code? Are they looking to attack take over our servers to them attack others with our machines? If it is an attack, why use Chinese IP addresses? Why not use their bit farms? If it is legitimate search engine, why not respect robots.txt?


"If it is an attack, why use Chinese IP addresses?"

Because there is nothing, you can do about it anyway?

Even if you could proof, it is an attack, do you really would consider sueing some chinese IP adresses?

But I rather suspect, it is just bad crawler code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: