We are experts in developing information extraction solutions based on machine learning models, participate in information extraction research and regularly give presentation about the topic.
Talk with Product Lead Axel Besinger about your Information Extraction projects.
From OCR to LLMs: The journey to reliable data extraction from complex retail documents
Axel Besinger and
Augusto Stoffel (PhD)
May 23rd, 2025
AI-powered data extraction works - until it doesn’t. When handling structured tables in invoices, orders, or financial documents, we expect OCR, LLMs, and Vision AI to extract data reliably. However, complex documents - e.g. nested tables, irregular structures, and edge cases - pose real challenges for document data extraction AI models. With our solution smartextract, we tackled a real-world customer challenge: automating order entry from complex order documents and tables for a German shoe retailer: OCR and text-based LLMs struggled, Vision LLMs were inconsistent. Only extensive customization could solve the appearing problems - including segmentation, few-shot prompting, fine-tuning, and even the possibility of training a custom computer vision model. In this talk, we will show why standard AI models struggle with complex tables and demonstrate in which cases segmentation helps. Further, we will show benchmarks of commercial vs. open-source models and discuss the trade-offs between OCR, LLMs, and computer vision models.
dida talks
Axel Besinger
Augusto Stoffel (PhD)
Data extraction in the age of LLMs
Axel Besinger and
Augusto Stoffel (PhD)
May 31st, 2024
In recent years, the advent of Large Language Models (LLMs) has changed the landscape of data extraction. These LLMs boast unparalleled text processing capabilities and come pre-trained on vast amounts of data, rendering them effective for information retrieval tasks. However, traditional methods such as graph neural networks and extractive models have historically been favored for their efficiency in resource utilization. Despite this, the question persists: how do LLMs compare with those models in practical data extraction applications? This presentation aims to delve into this inquiry, providing a comprehensive examination of LLMs' advantages and disadvantages compared to extractive models. Drawing from our project experiences and internal research, we aim to elucidate the practical implications of utilizing LLMs for data extraction, offering insights into their efficacy, resource requirements, and overall performance in real-world scenarios. Through this exploration, attendees will gain a deeper understanding of the role of LLMs in modern data extraction workflows and the considerations involved in their implementation. Link to the information extraction software: smartextract ( https://smartextract.ai )
dida talks
Angela Maennel
Understanding Customer Needs with NLP
Angela Maennel
January 19th, 2023
This talk is about the benefits of Natural Language Processing (NLP) in providing flexibility over traditional restrictive online input methods. It hints at the freedom NLP offers, allowing for free-form text instead of limited set phrases or checkboxes.
dida talks
Jona Welsch
Information extraction with BERT from free-form text
Jona Welsch
April 28th, 2023
Jona Welsch's talk centers on using Deep Learning methods like BERT to extract information from unstructured text. A project with idealo serves as a case study, showcasing how rule-based algorithms and Deep Learning can be combined to turn product descriptions into structured data. The talk also touches on creating weakly labeled training data to ease the labeling process.
dida talks
Augusto Stoffel (PhD)
Graph neural networks for information extraction with PyTorch
Augusto Stoffel (PhD)
July 30th, 2021
In Augusto Stoffel's talk, he introduces graph neural networks (GNNs) by comparing them to convolutional neural networks (CNNs). He describes how an image can be represented as a graph to naturally transition into the basics of GNN architecture. The talk then covers Python implementations, particularly in the PyTorch framework, and focuses on GNN applications in information extraction from tabular documents in the field of NLP.
dida talks
Ewelina Fiebig
Fabian Gringel
Labeling Tools - The second step on the way to the successful implementation of an NLP project
Ewelina Fiebig and
Fabian Gringel
May 26th, 2021
The success of an NLP project consists of a series of steps from data preparation to modeling and deployment. Since the input data are often scanned documents, the data preparation step initially involves the use of text recognition tools (OCR for short) and later on also the use of so-called labeling tools. In this webinar we will deal with the topic of selecting a suitable labeling tool.
dida talks
Ewelina Fiebig
Fabian Gringel
Text recognition (OCR) - The first step on the way to a successful implementation of an NLP project
Ewelina Fiebig and
Fabian Gringel
May 26th, 2021
In this talk we will deal with the topic of text recognition and introduce you to: What does OCR mean? Example of use Why is OCR needed? What OCR tools are available? How are these tools used? Which tool fits to which problem?