1. PDFSegmenter (in the notebook) - extract the images of the text (yup, it does images too)
2. An OCR Executor [0][1] from Jina Hub [2] to extract the text from the images
3. Actually splice the text chunks together to be what you'd expect - that's the tricky part. Even text splitting over pages can be tricky to reassemble properly. PDFs are a pain the butt frankly.
1. PDFSegmenter (in the notebook) - extract the images of the text (yup, it does images too) 2. An OCR Executor [0][1] from Jina Hub [2] to extract the text from the images 3. Actually splice the text chunks together to be what you'd expect - that's the tricky part. Even text splitting over pages can be tricky to reassemble properly. PDFs are a pain the butt frankly.
[0] https://hub.jina.ai/executor/78yp7etm
[1] https://hub.jina.ai/executor/w4p7905v
[2] https://hub.jina.ai