Here are some tips not mentioned: 1. <domain>/robots.txt can sometimes have usef...

deanishe · on Feb 21, 2024

This. A lot of modern sites can be really easy to scrape. Lots of machine-readable data.

APIs (for SPAs), OpenGraph/LD+JSON data in <head>, and data- attributes with proper data in them (e.g. a timestamp vs "just now" in the text for the human).

Scraping is a lot easier than it used to be.

zffr · on Feb 21, 2024

Adding on to this, if an app uses client-side hydration (ex Next apps) sometimes you can find a big JSON object in the HTML with all the page data. In these cases you can usually write some custom code to extract and parse this JSON object. Sometimes the JSON is embedded in some JavaScript code so you need to use a little regex to extract it.