souvik3333's comments

souvik3333 · 2025-10-15T12:23:14 1760530994

We have evaluated against Gemini-2.5-flash. You can check the benchmarks here https://nanonets.com/research/nanonets-ocr-2/#markdown-evalu...

souvik3333 · 2025-10-15T12:13:18 1760530398

We have developed DocStrange to create LLM-ready data from images and PDFs. We have open-sourced a 3B finetuned model also. You can try both the open-sourced and private models from the demo.

HF: https://huggingface.co/nanonets/Nanonets-OCR2-3B Demo: https://docstrange.nanonets.com/ Blog: https://nanonets.com/research/nanonets-ocr-2/

This model is an improvement over our last open-source model. We have fixed some of the issues that the community faced and some of the features that were requested (handwritten, multi-lingual).

The models are trained on 3 million documents, including handwritten documents, financial reports, complex tables, documents with watermarks, and stamps. Feel free to try it and share feedback.

AdityaNahata · 2025-10-15T12:14:47 1760530487

Do you guys provide api support also? I am processing documents for a project

souvik3333 · 2025-10-15T12:15:59 1760530559

Yeah, we do have api support. Currently, you can process 10k documents per month free. Let me know if you face any issues.

souvik3333 · 2025-06-16T14:27:44 1750084064

We have trained the model on tables with hierarchical column headers and with rowspan and colspan >1. So it should work fine. This is the reason we predict the table in HTML instead of markdown.

nehalem · 2025-06-16T15:15:36 1750086936

Thank you. I was rather thinking of magazine like layouts with columns of text and headers and footers on every page holding article title and page number.

souvik3333 · 2025-06-16T16:06:40 1750090000

It should work there also. We have trained on research papers with two columns of text. Generally, papers have references as a footer and contains page number.

souvik3333 · 2025-06-16T13:30:33 1750080633

Actually, we have trained the model to convert to markdown and do semantic tagging at the same time. Eg, the equations will be extracted as LaTeX equations, and images (plots, figures, and so on) will be described within the `<img>` tags. Same with `<signature>`, `<watermark>`, <page_number>.

Also, we extract the tables as HTML tables instead of markdown for complex tables.

mgr86 · 2025-06-16T14:22:51 1750083771

Have you considered XML. TEI, for example, is very robust and mature for marking up documents.

esafak · 2025-06-16T14:57:53 1750085873

First I heard of it. https://en.wikipedia.org/wiki/Text_Encoding_Initiative

mgr86 · 2025-06-16T15:03:46 1750086226

Understandable. I work in academic publishing, and while the XML is everywhere crowd is graying, retiring, or even dying :( it still remains an excellent option for document markup. Additionally, a lot of government data produced in the US and EU make heavy use of XML technologies. I imagine they could be an interested consumer of Nanonets-OCR. TEI could be a good choice as well tested and developed conversions exist to other popular, less structured, formats.

agoose77 · 2025-06-16T17:26:33 1750094793

Do check out MyST Markdown (https://mystmd.org)! Academic publishing is a space that MyST is being used, such as https://www.elementalmicroscopy.com/ via Curvenote.

(I'm a MyST contributor)

viraptor · 2025-06-16T20:59:21 1750107561

Do you know why myst got traction, instead of RST which seems to have all the custom tagging and extensibility build in from the beginning?

agoose77 · 2025-06-17T10:44:13 1750157053

MyST Markdown (the MD flavour, not the same-named Document Engine) was inspired by ReST. It was created to address the main pain-point of ReST for incoming users (it's not Markdown!).

As a project, the tooling to parse MyST Markdown was built on top of Sphinx, which primarily expects ReST as input. Now, I would not be surprised if most _new_ Sphinx users are using MyST Markdown (but I have no data there!)

Subsequently, the Jupyter Book project that built those tools has pivoted to building a new document engine that's better focused on the use-cases of our audience and leaning into modern tooling.

jxramos · 2025-06-16T17:17:22 1750094242

maybe even epub, which is xhtml

lukev · 2025-06-16T21:15:30 1750108530

Yeah this really hurts. If your goal is to precisely mark up a document with some structural elements, XML is strictly superior to Markdown.

The fact that someone would go to all the work to build a model to extract the structure of documents, then choose an output format strictly less expressive than XML, speaks poorly of the state of cross-generational knowledge sharing within the industry.

prats226 · 2025-06-17T00:00:11 1750118411

I think the choice mainly stems from how you want to use the output. If the output is going to get fed to another LLM, then you want to select markup language where 1) the grammer would not cause too many issues with tokenization 2) which LLM has seen a lot in past 3) generates minimal number of tokens. I think markdown fits it much better compared to other markup languages.

If goal is to parse this output programmatically, then I agree a more structured markup language is better choice.

jtbayly · 2025-06-16T13:33:16 1750080796

What happens to footnotes?

souvik3333 · 2025-06-16T14:28:34 1750084114

They will be extracted in a new line as normal text. It will be the last line.

jtbayly · 2025-06-16T22:00:57 1750111257

So I’m left to manually link them up?

Have you considered using something like Pandoc’s method of marking them up? Footnotes are a fairly common part of scanned pages, and markdown that doesn’t indicate that a footnote is a footnote can be fairly incomprehensible.

agoose77 · 2025-06-17T10:45:06 1750157106

I am lazily posting this all over the thread, but do check out MyST Markdown too! https://mystmd.org. We handle footnotes as a structured object.

souvik3333 · 2025-06-16T09:19:54 1750065594

We have not trained explicitly on handwriting datasets (completely handwritten documents). But, there are lots of forms data with handwriting present in training. So, do try on your files, there is a huggingface demo, you can quickly test there: https://huggingface.co/spaces/Souvik3333/Nanonets-ocr-s

We are currently working on creating completely handwritten document datasets for our next model release.

Eisenstein · 2025-06-16T10:31:45 1750069905

Document:

* https://imgur.com/cAtM8Qn

Result:

* https://imgur.com/ElUlZys

Perhaps it needed more than 1K tokens? But it took about an hour (number 28 in queue) to generate that and I didn't feel like trying again.

How many tokens does it usually take to represent a page of text with 554 characters?

souvik3333 · 2025-06-16T10:53:48 1750071228

Hey, the reason for the long processing time is that lots of people are using it, and with probably larger documents. I tested your file locally seems to be working correctly. https://ibb.co/C36RRjYs

Regarding the token limit, it depends on the text. We are using the qwen-2.5-vl tokenizer in case you are interested in reading about it.

You can run it very easily in a Colab notebook. This should be faster than the demo https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...

There are incorrect words in the extraction, so I would suggest you to wait for the handwritten text model's release.

mdaniel · 2025-06-16T14:17:50 1750083470

> I tested your file locally seems to be working correctly

Apologies if there's some unspoken nuance in this exchange, but by "working correctly" did you just mean that it ran to completion? I don't even recognize some of the unicode characters that it emitted (or maybe you're using some kind of strange font, I guess?)

Don't misunderstand me, a ginormous number of floating point numbers attempting to read that handwriting is already doing better than I can, but I was just trying to understand if you thought that outcome is what was expected

Eisenstein · 2025-06-16T14:26:28 1750083988

It actually did a decent. Perhaps the font is weird? For reference here is the 'ground truth' content, not in markdown:

Page# 8

Log: MA 6100 2.03.15

34 cement emitter resistors - 0.33R 5W 5% measure 0.29R 0.26R

35 replaced R436, R430 emitter resistors on R-chn P.O. brd w/new WW 5W .33R 5% w/ ceramic lead insulators

36 applied de-oxit d100 to speaker outs, card terminals, terminal blocks, output trans jacks

37 replace R-chn drivers and class A BJTs w/ BD139/146, & TIP31AG

38 placed boards back in

39 desoldered grnd lug from volume control

40 contact cleaner, Deoxit D5, faderlube on pots & switches teflon lube on rotor joint

41 cleaned ground lug & resoldered, reattached panel

souvik3333 · 2025-06-16T14:50:25 1750085425

This is the result. ``` Page 1 of 1 Page # <page_number>8</page_number>

Log: MA 6100 Z. O 3. 15

<table> <tr> <td>34</td> <td>cement emitter resistors -</td> </tr> <tr> <td></td> <td>0.33 R SW 5% measure</td> </tr> <tr> <td></td> <td>0.29 R, 0.26 R</td> </tr> <tr> <td>35</td> <td>replaced R'4 36, R4 30</td> </tr> <tr> <td></td> <td>emitter resistor on R-44</td> </tr> <tr> <td></td> <td>0.0. 3rd w/ new WW 5W .33R</td> </tr> <tr> <td>36</td> <td>% w/ ceramic lead insulators</td> </tr> <tr> <td></td> <td>applied de-oat d100 to Speak</td> </tr> <tr> <td></td> <td>outs, card terminals, terminal</td> </tr> <tr> <td></td> <td>blocks, output tran jacks</td> </tr> <tr> <td>37</td> <td>replace &-clun diviers</td> </tr> <tr> <td></td> <td>and class A BJTs w/ BD139/140</td> </tr> <tr> <td></td> <td>& TIP37A2</td> </tr> <tr> <td>38</td> <td>placed boards back in</td> </tr> <tr> <td>39</td> <td>desoldered ground lus from volume</td> </tr> <tr> <td></td> <td>(con 48)</td> </tr> <tr> <td>40</td> <td>contact cleaner, Deox. t DS, facel/42</td> </tr> <tr> <td></td> <td>on pots & switches</td> </tr> <tr> <td></td> <td>· teflon lube on rotor joint</td> </tr> <tr> <td>41</td> <td>reably cleaned ground lus &</td> </tr> <tr> <td></td> <td>resoldered, reattatched panel</td> </tr> </table> ```

You can paste it in https://markdownlivepreview.com/ and see the extraction. This is using the Colab notebook I have shared before.

Which Unicode characters are you mentioning here?

souvik3333 · 2025-06-16T07:05:34 1750057534

The model was primarily trained on English documents, which is why English is listed as the main language. However, the training data did include a smaller proportion of Chinese and various European languages. Additionally, the base model (Qwen-2.5-VL-3B) is multilingual. Someone on Reddit mentioned it worked on Chinese: https://www.reddit.com/r/LocalLLaMA/comments/1l9p54x/comment...

souvik3333 · 2025-06-16T06:58:14 1750057094

Hi, author of the model here..

We have a benchmark for evaluating VLM on document understanding tasks: https://idp-leaderboard.org/ . But unfortunately, it does not include image to markdown as a task. The problem with evaluating an image to markdown is that even if the order of two blocks are different, it can still be correct. Eg: if you have both seller info and buyer info side by side in the image one model can extract the seller info first, and another model can extract the buyer info first. Both model will be correct but depending on the ground truth if you do fuzzy matching one model will have higher accuracy than the other one.

Normally, a company will train and test on a dataset that is trained on the same type of annotation (either left block first or right block first), and all other models can get a low score on their benchmark because they are trained on the opposite order of annotations.

tensor · 2025-06-16T15:41:33 1750088493

The more important thing to me with any VLM is base OCR performance and hallucinations. It's not too hard to get improved average accuracy on very low quality scans using language models. Unfortunately these also typically produce large numbers of hallucinations, which are a deal breaker if you are trying to get out values for financial or legal purposes.

OCR that has lower accuracy, but where the inaccurate parts are left blank or flagged are far superior. Mistral OCR also suffers from this problem.

If your OCR produced bounding boxes for every text line, and ran a traditional OCR on the text, this could alleviate it. Or at the very least bounding boxes let users cross-correlate with output from traditional OCR.

Also a small note, it's probably best not to say your product beats Mistral when it's not even tested against it. Having more features doesn't make a product better if the accuracy is not better on those features.

I don't mean to be discouraging, this is an important space and it looks like you have a very feature rich model. I'd like to see a good solution be developed!

krapht · 2025-06-16T15:01:15 1750086075

If this is the only issue, can't this be addressed by normalizing the post-processed data before scoring? (that is, if it really is just a matter of block ordering)

souvik3333 · 2025-06-16T06:56:34 1750056994

Hi, author of the model here. It is an open-weight model, you can download it from here: https://huggingface.co/nanonets/Nanonets-OCR-s

gardnr · 2025-06-16T08:09:49 1750061389

Interestingly, another OCR model based on Qwen2.5-VL-3B just dropped which also publishes as Apache 2. It's right next to Nanonets-OCR-s on the HF "Trending" list.

https://huggingface.co/echo840/MonkeyOCR/blob/main/Recogniti...

CaptainFever · 2025-06-16T15:47:22 1750088842

IMO weights being downloadable doesn't mean it's open weight.

My understanding:

    - Weight available: You can download the weights.
    - Open weight: You can download the weights, and it is licensed freely (e.g. public domain, CC BY-SA, MIT).
    - Open source: (Debated) You can download the weights, it is licensed freely, and the training dataset is also available and licensed freely.

For context:

> You're right. The Apache-2.0 license was mistakenly listed, and I apologize for the confusion. Since it's a derivative of Qwen-2.5-VL-3B, it will have the same license as the base model (Qwen RESEARCH LICENSE AGREEMENT). Thanks for pointing this out.