Information Extraction


We are experts in developing information extraction solutions based on machine learning models, participate in information extraction research and regularly give presentation about the topic.

Talk with Product Lead Axel Besinger about your Information Extraction projects.

  • 30min
  • |
  • online

Learn more ->

Extracting information from technical drawings


Frank Weilandt (PhD)


Did you ever need to combine data about an object from two different sources, say, images and text? We are often facing such challenges during our work at dida. Here we present an example from the realm of technical drawings. Such drawings are used in many fields for specialists to share information. They consist of drawings that follow very specific guidelines so that every specialist can understand what is depicted on them. Normally, technical drawings are given in formats that allow indexing, such as svg, html, dwg, dwf, etc. but many, especially older ones, only exist in image format (jpeg, png, bmp, etc.), for example from book scans. This kind of drawings is hard to access automatically which makes its use hard and time consuming. In this regard, automatic detection tools could be used to facilitate the search. In this blogpost, we will demonstrate how both traditional and deep-learning based computer vision techniques can be applied for information extraction from exploded-view drawings. We assume that such a drawing is given together with some textual information for each object on the drawing. The objects can be identified by numbers connected to them. Here is a rather simple example of such a drawing: An electric drill machine. There are three key components on each drawing: The numbers, the objects and the auxiliary lines. The auxiliary lines are used to connect the objects to the numbers. The task at hand will be to find all objects of a certain kind / class over a large number of drawings , e.g. the socket with number 653 in the image above appears in several drawings and even in drawings from other manufacturers. This is a typical classification task, but with a caveat: Since there is additional information for each object accessible through the numbers, we need to assign each number on the image to the corresponding object first. Next we describe this auxiliary task can be solved by using traditional computer vision techniques.

How to extract text from PDF files


Lovis Schmidt


In NLP projects the input documents often come as PDFs. Sometimes the PDFs already contain underlying text information, which makes it possible to extract text without the use of OCR tools. In the following I want to present some open-source PDF tools available in Python that can be used to extract text. I will compare their features and point out some drawbacks. Those tools are PyPDF2 , pdfminer and PyMuPDF . There are other Python PDF libraries which are either not able to extract text or focused on other tasks. Furthermore, there are tools that are able to extract text from PDF documents, but which are not available in Python. Both will not be discussed here. You might also want to read about past dida projects where we developed an information extraction with AI for product descriptions, an information extraction from customer requests or an information extraxction from PDF invoices .

Comparison of OCR tools: how to choose the best tool for your project


Fabian Gringel


Optical character recognition (short: OCR) is the task of automatically extracting text from images (coming as typical image formats such as PNG or JPG, but possibly also as a PDF file). Nowadays, there are a variety of OCR software tools and services for text recognition which are easy to use and make this task a no-brainer. In this blog post, I will compare four of the most popular tools: Tesseract OCR ABBYY FineReader Google Cloud Vision Amazon Textract I will show how to use them and assess their strengths and weaknesses based on their performance on a number of tasks. After reading this article you will be able to choose and apply an OCR tool suiting the needs of your project. Note that we restrict our focus on OCR for document images only, as opposed to any images containing text incidentally. Now let’s have a look at the document images we will use to assess the OCR engines.

Extracting information from documents


Frank Weilandt (PhD)


There is a growing demand for automatically processing letters and other documents. Powered by machine learning, modern OCR (optical character recognition) methods can digitize the text. But the next step consists of interpreting it. This requires approaches from fields such as information extraction and NLP (natural language processing) . Here we go through some heuristics how to read the date of a letter automatically using the Python OCR tool pytesseract . Hopefully, you can adapt some ideas to your own project.