Definitely I'll move to the LaTeX source code instead of a PDF backend since tha...

Koaisu · on Dec 21, 2023

Maybe you could use: https://github.com/facebookresearch/nougat/tree/main or https://github.com/VikParuchuri/marker

Both are tools to convert pdfs into Latex or Markup with latex formulas. Maybe that helps

acqq · on Dec 21, 2023

Reading the motivation for the second:

"Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper: We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents. In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages."

When these are fed in the next levels as inputs, isn't it even less surprising to get even more hallucinations/repetitions?

Mandelmus · on Dec 21, 2023

Yes, using the LaTeX source code (or HTML, once that becomes reliable and widely used) should be much more robust than PDF parsing.

skeptrune · on Dec 21, 2023

That is a very good resource to save