TOMDM's favorites | Hacker News

For some comparison, I recently did an OCR comparison for some work for a professor. To set some context, all documents were 1960s era typed or handwritten documents in English, specifically from this archive - http://allenarchive.iac.gatech.edu/. I hand transcribed 50 documents to use as a base comparison and ran them through the various OCR engines getting the results below.

                           Overall       Typed  Handwritten
  OCR Engine          Leven   Cosine  Leven   Cosine  Leven   Cosine
  Amazon Textract     91.63%  98.14%  92.07%  98.76%  87.99%  92.10%
  Google Vision       93.05%  97.97%  93.84%  98.99%  85.86%  88.11%
  Microsoft Azure     80.32%  95.61%  80.65%  96.20%  79.14%  90.21%
  TrOCR               78.66%  93.97%  80.64%  96.65%  59.96%  67.89%
  PaddleOCR           84.82%  90.73%  88.60%  96.28%  49.64%  37.58%
  Tesseract           86.67%  89.53%  91.14%  95.63%  44.54%  31.39%
  Easy OCR            81.79%  85.07%  85.50%  91.89%  46.87%  19.23%
  Keras OCR           58.03%  83.57%  59.32%  89.98%  46.08%  21.20%

Leven is Levenshtein Distance. Overall is a weighted average of typed vs handwritten, 90/10 if I recall correctly. All results were run on my personal machine with a 5950X, 128 GB RAM, and a RTX 3080.

From my analysis the Amazon Textract was excellent, the best of all the paid ones, and while TrOCR and PaddleOCR were the best FOSS ones, the issue with them is that they require a GPU while Tesseract I could use on CPU alone. For instance to OCR all 50 documents.

  Tessearct       1:19
  TrOCR (GPU)    27:33
  TrOCR (CPU)  3:04:22

TrOCR is great if you need to do a few or have GPUs to burn, but Tesseract is by far better if you need good enough for a large volume of documents, and for my project the intent was to make a software plugin that could be sent to libraries/universities, CPU is king.