1. <domain>/robots.txt can sometimes have useful info for scraping a website. It will often include links to sitemaps that let you enumerate all pages on a site. This is a useful library for fetching/parsing a sitemap (https://github.com/mediacloud/ultimate-sitemap-parser)
2. Instead of parsing HTML tags, sometimes you can extract the data you need through structured metadata. This is a useful library for extracting it into JSON (https://github.com/scrapinghub/extruct)
This. A lot of modern sites can be really easy to scrape. Lots of machine-readable data.
APIs (for SPAs), OpenGraph/LD+JSON data in <head>, and data- attributes with proper data in them (e.g. a timestamp vs "just now" in the text for the human).
Adding on to this, if an app uses client-side hydration (ex Next apps) sometimes you can find a big JSON object in the HTML with all the page data. In these cases you can usually write some custom code to extract and parse this JSON object. Sometimes the JSON is embedded in some JavaScript code so you need to use a little regex to extract it.
1. <domain>/robots.txt can sometimes have useful info for scraping a website. It will often include links to sitemaps that let you enumerate all pages on a site. This is a useful library for fetching/parsing a sitemap (https://github.com/mediacloud/ultimate-sitemap-parser)
2. Instead of parsing HTML tags, sometimes you can extract the data you need through structured metadata. This is a useful library for extracting it into JSON (https://github.com/scrapinghub/extruct)