Is there a "html reducer" out there? I've been considering writing one. If you t...

axg11 · on Sept 2, 2024

I wrote an in-house one for Ribbon. If there’s interest, will open source this. It’s amazing how much better our LLM outputs are with the reducer.

parhamn · on Sept 2, 2024

Yes! Happy to try it on a fairly large user base and contribute to it! Email in bio if you want a beta user.

Opocio · on Sept 13, 2024

CalRobert · on Sept 3, 2024

Add my voice to the chorus asking for this!

Tostino · on Sept 2, 2024

I'm absolutely interested in this.

ushtaritk421 · on Sept 3, 2024

I would be very interested in this.

kurthr · on Sept 2, 2024

That would be wonderful.

yadaeno · on Sept 3, 2024

chrisrickard · on Sept 3, 2024

faefox · on Sept 3, 2024

7thpower · on Sept 3, 2024

Yes please

downWidOutaFite · on Sept 3, 2024

guwop · on Sept 3, 2024

sounds amazing!

simonw · on Sept 2, 2024

Jina.ai offer a really neat (currently free) API for this - you add https://r.jina.ai/ on the beginning of any API and it gives you back a Markdown version of the main content of that page, suitable for piping into an LLM.

Here's an example: https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anato... - for this page: https://simonwillison.net/2024/Sep/2/anatomy-of-a-textual-us...

Their code is open source so you can run your own copy if you like: https://github.com/jina-ai/reader - it's written in TypeScript and uses Puppeteer and https://github.com/mozilla/readability

I've been using Readability (minus the Markdown) bit myself to extract the title and main content from a page - I have a recipe for running it via Playwright using my shot-scraper tool here: https://shot-scraper.datasette.io/en/stable/javascript.html#...

    shot-scraper javascript https://simonwillison.net/2024/Sep/2/anatomy-of-a-textual-user-interface/ "
    async () => {
      const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
      return (new readability.Readability(document)).parse();
    }"

BeetleB · on Sept 3, 2024

+1 for this. I too use Readability via Simon's shot-scraper tool.

suchintan · on Sept 3, 2024

We wrote something like this to power Skyvern: https://github.com/Skyvern-AI/skyvern/blob/0d39e62df6c516e0a...

It's adapted from vimium and works like a charm. Distill the html down to it's important bits, and handle a ton of edge cases along the way haha

ErikAugust · on Sept 2, 2024

Running it through Readability:

https://github.com/mozilla/readability

parhamn · on Sept 2, 2024

I snuck in an edit about readability before I saw your reply. The quality of that one in particular is very meh, especially for most new sites and then you lose all of the dom structure in case you want to do more with the page. Though now I'm curious how it works on the weather.com page the author tried. pupeteer -> screenshot -> ocr (or even multi-modal which many do OCR first) -> LLM pipeline might work better there.

LunaSea · on Sept 3, 2024

Issue is that Llama are x100 more expensive at the very least.

edublancas · on Sept 2, 2024

author here: I'm working on a follow-up post. Turns out, removing all HTML tags works great and reduces the cost by a huge margin.

AbstractH24 · on Sept 3, 2024

Am I crazy or is there no way to “subscribe” to your site? Interested to follow your learnings in this area.

edublancas · on Sept 3, 2024

there isn't. but you can connect X or LinkedIn.

I might add a subscribe button once I get some time :)

7thpower · on Sept 3, 2024

What do you mean? What do you use as reference points?

edublancas · on Sept 3, 2024

nothing, I strip out all the HTML tags and pass raw text

isaacfung · on Sept 3, 2024

How do you keep table structure?

jaimehrubiks · on Sept 3, 2024

They should probably keep tables and lists and strip most of the rest.

lelandfe · on Sept 2, 2024

Only works insofar as sites are being nice. A lot of sites do things like: render all text via JS, render article text via API, paywall content by showing a preview snippet of static text before swapping it for the full text (which lives in a different element), lazyload images, lazyload text, etc etc.

DOM parsing wasn't enough for Google's SEO algo, either. I'll even see Safari's "reader mode" fail utterly on site after site for some of these reasons. I tend to have to scroll the entire page before running it.

zexodus · on Sept 3, 2024

It's possible to capture the DOM by running a headless browser (i.e. with chromedriver/geckodriver), allowing the js execute and then saving the HTML.

If these readers do not use already rendered HTML to parse the information on the screen, then...

lelandfe · on Sept 3, 2024

Indeed, Safari's reader already upgrades to using the rendered page, but even it fails on more esoteric pages using e.g. lazy loaded content (i.e. you haven't scrolled to it yet for it to load); or (god forbid) virtualized scrolling pages, which offloads content out of view.

It's a big web out there, there's even more heinous stuff. Even identifying what the main content is can be a challenge.

And reader mode has the benefit of being ran by the user. Identifying when to run a page-simplifying action on some headlessly loaded URL can be tricky. I imagine it would need to be like: load URL, await load event, scroll to bottom of page, wait for the network to be idle (and possibly for long tasks/animations to finish, too)

purple-leafy · on Sept 3, 2024

I wrote one for a project that captures a portion of the DOM and sends it to an LLM.

It’s strips all JS/event handlers, most attributes and most CSS, and only keeps important text nodes

I needed this because I was using LLM to reimplement portions of a page using just tailwind, so needed to minimise input tokens

nickpsecurity · on Sept 3, 2024

That’s easy to do with BeautifulSoup in Python. Look up tutorials on that. Use it on non-essential tags. That will at least work when the content is in HTML rather than procedurally generated (eg JavaScript).