Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Under the hood, it uses https://github.com/pdfminer/pdfminer.six which expects the text to be stored as text.


You mean the PDFSegmenter Executor in the notebook?


Yes


PDFSegmenter also extracts images, which can then be OCR'ed in the next step of the pipeline




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: