Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Readability is great, and I use it, but it's odd how half-assed the maintenance for it has been. I've haven't seen any noticeable improvements to it in quite some time, and when I've looked for alternatives, it usually turns out they're using it under the hood in some capacity.

Perhaps it's already being made obsolete by LLM technologies? I'd be curious to hear from anyone who's used a locally running LLM to extract written content, especially if it's been built specifically for that task.



> Readability is great, and I use it, but it's odd how half-assed the maintenance for it has been. I've haven't seen any noticeable improvements to it in quite some time

What improvements are you looking for? For me, it works over 95% of the time, so I'm happy. Occasionally it excises a section (e.g. "too short" heuristic), and I wish it was smarter about it. But like you, I haven't found better alternatives. I also need something I can run in a script.

> Perhaps it's already being made obsolete by LLM technologies? I'd be curious to hear from anyone who's used a locally running LLM to extract written content, especially if it's been built specifically for that task.

It would be good to benchmark this across, say, 50 sites and see which one performs better. At the moment, I don't know if I'd trust an LLM more than Readability - especially for longer content. Also, I wouldn't use it to scrape 90K sites. Both slow and expensive!


Although it works most of the time, I've found it's common for it to either pick up things that shouldn't be included or it only picks up something like the footer but not the actual body. This can be true even when, upon inspection, there's no clear reason why the body couldn't be identified. It's particularly problematic on many academic articles that are in HTML (sort of ironic). I'd also like a bit more normalization built in, even if it's turned off by default.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: