Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In that case I'd use:

1. PDFSegmenter (in the notebook) - extract the images of the text (yup, it does images too) 2. An OCR Executor [0][1] from Jina Hub [2] to extract the text from the images 3. Actually splice the text chunks together to be what you'd expect - that's the tricky part. Even text splitting over pages can be tricky to reassemble properly. PDFs are a pain the butt frankly.

[0] https://hub.jina.ai/executor/78yp7etm

[1] https://hub.jina.ai/executor/w4p7905v

[2] https://hub.jina.ai



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: