Is there a "html reducer" out there? I've been considering writing one. If you take a page's source it's going to be 90% garbage tokens -- random JS, ads, unnecessary properties, aggressive nesting for layout rendering, etc.
I feel like if you used a dom parser to walk and only keep nodes with text, the html structure and the necessary tag properties (class/id only maybe?) you'd have significant savings. Perhaps the xpath thing might work better too. You can even even drop necessary symbols and represent it as a indented text file.
We use readability for things like this but you lose the dom structure and their quality reduces with JS heavy websites and pages with actions like "continue reading" which expand the text.
Jina.ai offer a really neat (currently free) API for this - you add https://r.jina.ai/ on the beginning of any API and it gives you back a Markdown version of the main content of that page, suitable for piping into an LLM.
I've been using Readability (minus the Markdown) bit myself to extract the title and main content from a page - I have a recipe for running it via Playwright using my shot-scraper tool here: https://shot-scraper.datasette.io/en/stable/javascript.html#...
I snuck in an edit about readability before I saw your reply. The quality of that one in particular is very meh, especially for most new sites and then you lose all of the dom structure in case you want to do more with the page. Though now I'm curious how it works on the weather.com page the author tried. pupeteer -> screenshot -> ocr (or even multi-modal which many do OCR first) -> LLM pipeline might work better there.
Only works insofar as sites are being nice. A lot of sites do things like: render all text via JS, render article text via API, paywall content by showing a preview snippet of static text before swapping it for the full text (which lives in a different element), lazyload images, lazyload text, etc etc.
DOM parsing wasn't enough for Google's SEO algo, either. I'll even see Safari's "reader mode" fail utterly on site after site for some of these reasons. I tend to have to scroll the entire page before running it.
Indeed, Safari's reader already upgrades to using the rendered page, but even it fails on more esoteric pages using e.g. lazy loaded content (i.e. you haven't scrolled to it yet for it to load); or (god forbid) virtualized scrolling pages, which offloads content out of view.
It's a big web out there, there's even more heinous stuff. Even identifying what the main content is can be a challenge.
And reader mode has the benefit of being ran by the user. Identifying when to run a page-simplifying action on some headlessly loaded URL can be tricky. I imagine it would need to be like: load URL, await load event, scroll to bottom of page, wait for the network to be idle (and possibly for long tasks/animations to finish, too)
That’s easy to do with BeautifulSoup in Python. Look up tutorials on that. Use it on non-essential tags. That will at least work when the content is in HTML rather than procedurally generated (eg JavaScript).
I feel like if you used a dom parser to walk and only keep nodes with text, the html structure and the necessary tag properties (class/id only maybe?) you'd have significant savings. Perhaps the xpath thing might work better too. You can even even drop necessary symbols and represent it as a indented text file.
We use readability for things like this but you lose the dom structure and their quality reduces with JS heavy websites and pages with actions like "continue reading" which expand the text.
Whats the gold standard for something like this?