I’m searching for a free pre-trained LLM that can accurately identify and extract all parts of an invoice (e.g., customer name, address, date, etc.) from German PDFs. I’ve already tried Tesseract, space in Python, and our own trained models, but the results weren’t very good.
Can anyone recommend better pre-trained models for this task?
An excellent project that deals with invoice extraction. German PDFs increase in complexity. For specific, free re-trained LLMs might not be sufficient. Think about these:
Layout lm: Does a good job with document layout.
German invoices were used to refine Bert-based models.
Roberta plus extra instruction tailored to invoices.
I highly recommend using a commercial partner if you’re automating a real-world business process. Mistakes can disrupt your entire accounting system and expose you to serious liabilities.
While it’s costly, it doesn’t make sense unless you’re handling more than 100 invoices per month. Consider startups like Instabase, Paperbox, and Docdigitizer.
If you’re looking for pretrained large language models (LLMs) to extract invoice data from PDFs, here are two effective approaches based on the challenges of unstructured-to-structured data processing:
Langchain with Pydantic: Use Langchain, a Python-based LLM framework, alongside the Pydantic library. This allows you to define a well-structured schema in Python (e.g., customer name, address, payment details), which the LLM will follow. By specifying strict output formats like Jason, you can extract and structure data efficiently. Langchain offers parsers that convert unstructured data from PDFs into Python objects, providing reliable, structured outputs.
Unstract: For a more specialized approach, Unstract is an open-source platform designed for document data extraction using LLMs. It includes a prompt engineering environment, Prompt Studio, which allows for generic prompts to handle different document types (like invoices, contracts, or resumes) and ensures the extracted data is in the proper format. Unstract helps handle variations in document structures and ensures high-quality structured output in JSON.
Both methods involve extracting text from PDFs (via tools like LLMWhisperer) and leveraging LLMs to produce well-defined structured data.