Definitely I'll move to the LaTeX source code instead of a PDF backend since that allows better support for non textual data that gets poorly scraped by GROBID. That is a really cool development I didn't know about, also theres https://ar5iv.labs.arxiv.org/ which already has most arXiv papers as HTML documents. I chose GROBID because they not only parse the PDF but organize the text into logical sections for me (Intro, abstract, references) which I didn't want to manually do with heuristics that I'd have to devise.
"Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper: We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents. In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages."
When these are fed in the next levels as inputs, isn't it even less surprising to get even more hallucinations/repetitions?