I’m working on a project where I need to extract text from images of handwritten content. So far, I’ve been using the Google Vision API, which has performed well for some text, including handwriting. However, I’m curious if there’s a more direct solution specifically for handling handwriting.
Would it be beneficial to use an LLM that can directly process and read handwriting, or should I continue using traditional OCR methods like Google Vision? I’m aware that LLMs like GPT-4o and Gemini have these capabilities, but I’m uncertain about how effectively they handle image-based input or handwriting.
Has anyone tried using LLMs for OCR? What would you recommend, and are there particular models that excel in this area?
Additionally, I plan to use an LLM to summarize the handwritten text, so I’ll need an LLM at some point in the process anyway.
Avoid using LLMs for OCR, as it’s unnecessary and tends to be more error-prone. Keep in mind that OCR engines already utilize NLP technology for language correction; for instance, Tesseract has incorporated biLSTM for quite some time.
If cost isn’t a concern, then go ahead and use it. How frequently do you plan to use this? Are you prioritizing factors like cost or speed?
CNNs, transformers, and vision models specifically trained for handwriting will be much faster and more cost-effective. However, this depends on the difficulty of your test cases, and you might need to do some image cleaning.
I’ve had good results with QWEN2 VL 72B and 7B VL in my tests, outperforming Claude 3.5 and seeming comparable to ChatGPT-4o.
Yes, that sounds like a solid plan; I’ve tested it. The goal is to also use an LLM to summarize the handwritten text, so I will need an LLM at some stage in the process.
Cost shouldn’t be a significant concern since this is part of a product that will be sold to consumers. Users will help cover some of that cost through their subscriptions. Speed is less critical, as a processing time of 30 seconds is acceptable for this type of project.