Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Context size limits are usually the reason. Most websites I want to scrape end up being over 200K tokens. Tokenization for HTML isn't optimal because symbols like '<', '>', '/', etc. end up being separate tokens, whereas whole words can be one token if we're talking about plain text.

Possible approaches include transforming the text to MD or minimizing the HTML (e.g., removing script tags, comments, etc.).



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: