Machine Learning Blog | dida

LLM strategies part 1: Possibilities of implementing Large Language Models in your organization

David Berscheid

April 3rd 2024

Large Language Models (LLMs) are a highly discussed topic in current strategy meetings of organizations across all industries. This article is the first part of two, providing some guidelines for organizations to determine their LLM strategy. It will help you identify the strategy with the most benefits while finding ways of solving associated complexities. For more content on LLMs, see our LLM hub .

Extend the knowledge of your Large Language Model with RAG

Thanh Long Phan, Fabian Dechent

January 16th 2024

Large Language Models (LLMs) have rapidly gained popularity in Natural Language tasks due to their remarkable human-like ability to understand and generate text. Amidst great advances, there are still challenges to be solved on the way to building perfectly reliable assistants. LLMs are known to make up answers, often producing text that adheres to the expected style, but lacks accuracy or factual grounding. Generated words and phrases are chosen as they are likely to follow previous text, where the likelihood is adjusted to fit the training corpus as closely as possible. This gives rise to the possibility that a piece of information is outdated, if the corpus is not updated and the model retrained. Or that it is just factually incorrect, while the generated words have the quality of sounding correct and can be matched to the required genre. The core problem here is that the LLM does not know, what it does not know. In addition, even if a piece of information is correct, it is hard to track its source in order to enable fact-checking. In this article, we introduce RAG (Retrieval-Augmented Generation) as a method to address both problems and which thus aims to enhance the reliability and accuracy of information generated by LLMs.

Latest developments in the world of Natural Language Processing: A comparison of different language models

Justus Tschötsch

May 24th 2023

Natural language processing (NLP) is a rapidly evolving sub-field of artificial intelligence. With ever new developments and breakthroughs, language models are already able to understand and generate human-like language with impressive accuracy. To keep track and catch up, we will compare different language models and have a look at the latest advancements, opportunities, and challenges of natural language processing.

How ChatGPT is fine-tuned using Reinforcement Learning

Thanh Long Phan

April 11th 2023

At the end of 2022, OpenAI released ChatGPT (a Transformer-based language model) to the public. Although based on the already widely discussed GPT-3, it launched an unprecedented boom in generative AI. It is capable of generating human-like text and has a wide range of applications, including language translation, language modeling, and generating text for applications such as chatbots. Feel free to also read our introduction to LLMs . ChatGPT seems to be so powerful that many people consider it to be a substantial step towards artificial general intelligence. The main reason for the recent successes of language models such as ChatGPT lies in their size (in terms of trainable parameters). But making language models bigger does not inherently make them better at following a user's intent. A bigger model can also become more toxic and more likely to "hallucinate". To mitigate these issues and to more generally align models to user intentions, one option is to apply Reinforcement Learning. In this blog post, we will present an overview of the training process of ChatGPT, and have a closer look at the use of Reinforcement Learning in language modeling. Also interesting: Our aggregated collection of LLM content .

Recommendation systems - an overview

Konrad Mundinger

August 29th 2022

Recommendation systems are everywhere. We use them to buy clothes, find restaurants and choose which TV show to watch. In this blog post, I will give an overview of the underlying basic concepts, common use cases and discuss some limitations. This is the first of a series of articles about recommendation engines. Stay tuned for the follow-ups, where we will explore some of the mentioned concepts in much more detail! Already in 2010, 60 % of watch time on Youtube came from recommendations [1] and personalized recommendations are said to increase conversion rates on e-commerce sites by up to 5 times [2]. It is safe to say that if customers are presented with a nice pre-selection of products they will be less overwhelmed, more likely to consume something and have an overall better experience on the website. But how do recommendation engines work? Let's dive right in.

Image Captioning with Attention

Madina Kasymova

May 31st 2022

One sees an image and easily tells what is happening in it because it is humans’ basic ability to grasp and describe details about an image by just having a glance. Can machines recognize different objects and their relationships in an image and describe them in a natural language just like humans do? This is the problem image captioning tries to solve. Image captioning is all about describing images in natural language (such as English), combining two core topics of artificial intelligence: computer vision and natural language processing . Image captioning is an incredible application of deep learning that evolved considerably in recent years. This article will provide a high-level overview of image captioning architecture and explore the attention mechanism – the most common approach proposed to solve this problem. The most recent image captioning works have shown benefits in using a transformer-based approach, which is based solely on attention and learns the relationships between elements of a sequence without using recurrent or convolutional layers. We will not be considering transformer-based architectures here, instead we will focus only on the attention-based approach.

OpenAI Codex: Why the revolution is still missing

Fabian Gringel

February 18th 2022

In this blog post, I'll explain how Codex from OpenAI works, and in particular how it differs from GPT-3. I will outline why I think it should be used with caution and is not ready yet to revolutionize the software development process.

Ethics in Natural Language Processing

Marty Oelschläger (PhD)

December 20th 2021

AI and machine learning have entered significantly into our day-to-day lives. For example, we use search queries and are startled or even angered if the algorithm did not understand what we were actually looking for. Just image what an effort it would be to process all those queries by human individuals. In case you can't imagine, CollegeHumor already prepared a vision of that: Fortunately, we taught machines --- at least to some degree --- to "understand" human language. This branch of machine learning is called natural language processing (NLP). We already gave an introduction , if you want to review the basics. However, since search engines, chat bots, and other NLP algorithms are not humans, we can employ them on large scale, i.e. on global scale. Since there are ubiquitous and used by very different people in various contexts, we want them to be objective and neutral (and not to be an annoyed and skeptical man as in the video above). But what if they are not the neutral number crunchers? What if they are subjective and even carry harmful stereotypes against specific groups?

GPT-3 and beyond - Part 2: Shortcomings and remedies

Fabian Gringel

October 24th 2021

In the first part of this article I have described the basic idea behind GPT-3 and given some examples of what it is good at. This second and final part is dedicated to the “beyond” in the title. Here you will learn in which situations GPT-3 fails and why it is far from having proper natural language understanding, which approaches can help to mitigate the issues and might lead to the next breakthrough, what alternatives to GPT-3 there are already, and, in case you are wondering, what's the connection between GPT-3 and an octopus. Update February 14th '22: I have also included a section about OpenAI's new InstructGPT.

Data-centric Machine Learning: Making customized ML solutions production-ready

David Berscheid

October 6th 2021

By 2021, there is little doubt that Machine Learning (ML) brings great potential to today’s world. In a study by Bitkom , 30% of companies in Germany state that they have planned or least discussed attempts to leverage the value of ML. But while the companies’ willingness to invest in ML is rising, Accenture estimates that 80% – 85% of these projects remain a proof of concept and are not brought into production. Therefore at dida, we made it our core mission to bridge that gap between proof of concept and production software, which we achieve by applying data-centric techniques, among other things. In this article, we will see why many ML Projects do not make it into production, introduce the concepts of model- and data-centric ML, and give examples how we at dida improve projects by applying data-centric techniques.

GPT-3 and beyond - Part 1: The basic recipe

Fabian Gringel

September 27th 2021

GPT-3 is a neural network capable of solving a wide range of natural language processing (NLP) tasks, which has been presented by OpenAI in summer 2020 (upscaling the previous models GPT and GPT-2). For various tasks it has set new state-of-the-art performances and is considered by many as a substantial step into the direction of artificial general intelligence. “General intelligence” refers to the capability of not only behaving intelligently with respect to one set task, but also being able to adapt to and accomplish new, unforeseen tasks. This blog article is the first of a two-article-series on GPT-3. In this first article I will explain how GPT-3 works, what it is good at and why some people think it’s dangerous, and how you can try out a GPT-3-like model for free. The second part will deal with GPT-3’s weaknesses and where to expect the next breakthrough in the future.

CLIP: Mining the treasure trove of unlabeled image data

Fabian Gringel

June 21st 2021

Digitization and the internet in particular have not only provided us with a seemingly inexhaustible source of textual data, but also of images. In the case of texts, this treasure has been lifted in the form of task-agnostic pretraining by language models such as BERT or GPT-3. Contrastive Language-Image Pretraining (short: CLIP) now does a similar thing with images, or rather: the combination of images and texts. In this blog article I will give a rough non-technical outline of how CLIP works, and I will also show how you can try CLIP out yourself! If you are more technically minded and care about the details, then I recommend reading the original publication , which I think is well written and comprehensible.

21 questions we ask our clients: Starting a successful ML project

Emilius Richter

May 21st 2021

Automating processes using machine learning (ML) algorithms can increase the efficiency of a system beyond human capacity and thus becomes more and more popular in many industries. But between an idea and a well-defined project there are several points that need to be considered in order to properly assess the economic potential and technical complexity of the project. Especially for companies like dida that offer custom workflow automation software, a well-prepared project helps to quickly assess the feasibility and the overall technical complexity of the project goals -which, in turn, makes it possible to deliver software that fulfills the client's requirements. In this article, we discuss which topics should be considered in advance and why the questions we ask are important to start a successful ML software project.

Enhancing Search with Question Answering

Angela Maennel

April 26th 2021

What is called open-domain question answering in machine learning papers is nothing else than answering a question based on a large collection of texts, such as answering the question of a visitor of a large website, using the website's content. Due to recent progress in machine reading comprehension, open-domain question answering systems have drastically improved. They used to rely on redundancy of information but now they are able to “read” more carefully. Modern systems are able to quote a section of text that answers the question or even reformulate it. What is still an aspiration is to generate longer, paragraph-length answers or to use multiple sources to puzzle together an answer. Google recently implemented such a feature into their search engine. If they find a passage that answers the question typed into the search field, the first result shows the corresponding website with the passage highlighted. There are many different systems that tackle open-domain question answering, here I will go into detail on one system in particular, DrQA (by Chen et al. 2017 ). This particular system splits the task into two parts for each of which it is easier to get data than for the combined task. I will also explain how this idea can be used to create a question answering system for a website from an already existing search function.

dida's tech stack

Fabian Gringel

March 29th 2021

This article provides an overview of our tech stack at dida. Of course we adapt the tools we use to what is required by a given project, but the ones we have listed here are our go-to tools whenever we are free to choose. I will first describe the tools that shape our software development process, and then our favourite Python libraries and software tools for machine and deep learning.

Deploying software with Docker containers

Fabian Gringel

February 26th 2021

Here I will a give a brief introduction to using Docker containers for software deployment. I assume that we want to deploy a Python application myapp that has a web API.

Understanding graph neural networks by way of convolutional nets

Augusto Stoffel (PhD)

February 5th 2021

In this article, we will introduce the basic ideas behind graph neural networks (GNNs) through an analogy with convolutional neural networks (CNNs), which are very well known due to their prevalence in the field of computer vision. In fact, we'll see that convolutional nets are an example of GNNs, albeit one where the underlying graph is very simple, perhaps even boring. Once we see how to think of a convolutional net through this lens, it won't be hard to replace that boring graph with more interesting ones, and we'll arrive naturally at the general concept of GNN. After that, we will survey some applications of GNNs, including our use here at dida. But let's start with the basics.

How to identify duplicate files with Python

Ewelina Fiebig

September 28th 2020

Suppose you are working on an NLP project. Your input data are probably files like PDF, JPG, XML, TXT or similar and there are a lot of them. It is not unusual that in large data sets some documents with different names have exactly the same content, i.e. they are duplicates. There can be various reasons for this. Probably the most common one is improper storage and archiving of the documents. Regardless of the cause, it is important to find the duplicates and remove them from the data set before you start labeling the documents. In this blog post I will briefly demonstrate how the contents of different files can be compared using the Python module filecmp . After the duplicates have been identified, I will show how they can be deleted automatically.

How to extract text from PDF files

Lovis Schmidt

August 17th 2020

In NLP projects the input documents often come as PDFs. Sometimes the PDFs already contain underlying text information, which makes it possible to extract text without the use of OCR tools. In the following I want to present some open-source PDF tools available in Python that can be used to extract text. I will compare their features and point out some drawbacks. Those tools are PyPDF2 , pdfminer and PyMuPDF . There are other Python PDF libraries which are either not able to extract text or focused on other tasks. Furthermore, there are tools that are able to extract text from PDF documents, but which are not available in Python. Both will not be discussed here. You might also want to read about past dida projects where we developed an information extraction with AI for product descriptions, an information extraction from customer requests or an information extraxction from PDF invoices .

BERT for question answering (Part 1)

Mattes Mollenhauer (PhD)

July 22nd 2020

In this article, we are going to have a closer look at BERT - a state-of-the-art model for a range of various problems in natural language processing. BERT was developed by Google and published in 2018 and is for example used as a part of Googles search engine . The term BERT is an acronym for the term Bidirectional Encoder Representations from Transformers , which may seem quiet cryptic at first. The article is split up into two parts: In the first part we are going to see how BERT works and in the second part we will have a look at some of its practical applications - in particular, we are going to examine the problem of automated question answering .

BERT for question answering (Part 2)

Mattes Mollenhauer (PhD)

July 2nd 2020

In the first part of this blog post we saw how the BERT architecture works in general. Now we are going to examine a particular practical application of BERT: automated question answering .

The best free labeling tools for text annotation in NLP

Fabian Gringel

March 30th 2020

In this blog post I'm going to present the three best free text annotation tools for manually labeling documents in NLP ( Natural Language Processing ) projects. You will learn how to install, configure and use them and find out which one of them suits your purposes best. The tools I'm going to present are brat , doccano , INCEpTION . The selection is based on this comprehensive scientific review article and our hands-on experience of dida's NLP projects . I will discuss the tools one by one. For each of them, I will first give a general overview about what the tool is suited for, and then provide details (or links) regarding installation, configuration and usage. You might also find it interesting to check out our NLP content collection .

What is Bayesian Linear Regression? (Part 2)

Matthias Werner

March 4th 2020

In my previous blog post I have started to explain how Bayesian Linear Regression works. So far, I have introduced Bayes' Theorem , the Maximum Likelihood Estimator (MLE) and Maximum A-Posteriori (MAP) . Now we will delve into the mathematical depths of the details behind Bayesian Linear Regression.

Digital public administration: intuitive online access through AI

Jona Welsch

March 2nd 2020

The following article describes how AI can help to establish digital public administration services. To begin with, a fundamental problem is described that AI can solve at this point: Authorities often speak a language that is very different from the colloquial language. Using the example of business registrations and the AI model "BERT", a possible solution is explained and ideas for further areas of application are shown.

What is Bayesian Linear Regression? (Part 1)

Matthias Werner

February 17th 2020

Bayesian regression methods are very powerful, as they not only provide us with point estimates of regression parameters, but rather deliver an entire distribution over these parameters. This can be understood as not only learning one model, but an entire family of models and giving them different weights according to their likelihood of being correct. As this weight distribution depends on the observed data, Bayesian methods can give us an uncertainty quantification of our predictions representing what the model was able to learn from the data. The uncertainty measure could be e.g. the standard deviation of the predictions of all the models, something that point estimators will not provide by default. Knowing what the model doesn't know helps to make AI more explainable. To clarify the basic idea of Bayesian regression, we will stick to discussing Bayesian Linear Regression (BLR). BLR is the Bayesian approach to linear regression analysis. We will start with an example to motivate the method. To make things clearer, we will then introduce a couple of non-Bayesian methods that the reader might already be familiar with and discuss how they relate to Bayesian regression. In the following I assume that you have elementary knowledge of linear algebra and stochastics. Let's get started!

Beat Tracking with Deep Neural Networks

Julius Richter

January 31st 2020

This is the last post in the three part series covering machine learning approaches for time series and sequence modeling. In the first post , the basic principles and techniques for serial sequences in artificial neural networks were shown. The second post introduced a recent convolutional approach for time series called temporal convolutional network (TCN), which shows great performance on sequence-to-sequence tasks ( Bai, 2018 ). In this post, however, I will talk about a real world application which employs a machine learning model for time series analysis. To this end, I will present a beat tracking algorithm, which is a computational method for extracting the beat positions from audio signals. The presented beat tracking system ( Davies, 2019 ) is based on the TCN architecture which captures the sequential structure of audio input.

Comparison of OCR tools: how to choose the best tool for your project

Fabian Gringel

January 20th 2020

Optical character recognition (short: OCR) is the task of automatically extracting text from images (coming as typical image formats such as PNG or JPG, but possibly also as a PDF file). Nowadays, there are a variety of OCR software tools and services for text recognition which are easy to use and make this task a no-brainer. In this blog post, I will compare four of the most popular tools: Tesseract OCR ABBYY FineReader Google Cloud Vision Amazon Textract I will show how to use them and assess their strengths and weaknesses based on their performance on a number of tasks. After reading this article you will be able to choose and apply an OCR tool suiting the needs of your project. Note that we restrict our focus on OCR for document images only, as opposed to any images containing text incidentally. Now let’s have a look at the document images we will use to assess the OCR engines.

Temporal convolutional networks for sequence modeling

Julius Richter

January 6th 2020

This blog post is the second in a three part series covering machine learning approaches for time series. In the first post , I talked about how to deal with serial sequences in artificial neural networks. In particular, recurrent models such as the LSTM were presented as an approach to process temporal data in order to analyze or predict future events. In this post, however, I will present a simple but powerful convolutional approach for sequences which is called Temporal Convolutional Network (TCN). The network architecture was proposed in ( Bai, 2018 ) and shows great performance on sequence-to-sequence tasks like machine translation or speech synthesis in text-to-speech (TTS) systems. Before I describe the architectural elements in detail, I will give a short introduction about sequence-to-sequence learning and the background of TCNs.

Machine Learning Approaches for Time Series

Julius Richter

December 18th 2019

This post is the first part of a series of posts that are linked together as they all deal with the topic of time series and sequence modeling, respectively. In order to give a comprehensive piece of content easy to grasp, the series is segmented into three parts: How to deal with time series and serial sequences? A recurrent approach. Temporal Convolutional Networks (TCNs) for sequence modeling. Beat tracking in audio files as an application of sequence modeling.

How to distribute a Tensorflow model as a JavaScript web app

Johan Dettmar

December 2nd 2019

Anyone wanting to train a Machine Learning (ML) model these days has a plethora of Python frameworks to choose from. However, when it comes to distributing your trained model to something other than a Python environment, the number of options quickly drops. Luckily there is Tensorflow.js , a JavaScript (JS) subset of the popular Python framework with the same name. By converting a model such that it can be loaded by the JS framework, the inference can be done effectively in a web browser or a mobile app. The goal of this article is to show how to train a model in Python and then deploy it as a JS app which can be distributed online.

How Google Cloud facilitates Machine Learning projects

Johan Dettmar

October 25th 2019

Since not only the complexity of Machine Learning (ML) models but also the size of data sets continue to grow, so does the need for computer power. While most laptops today can handle a significant workload, the performance is often simply not enough for our purposes at dida. In the following article, we walk you through some of the most common bottlenecks and show how cloud services can help to speed things up.

What is Natural Language Processing (NLP)?

Fabian Gringel

August 12th 2019

Natural Language Processing (short: NLP , sometimes also called Computational Linguistics ) is one of the fields which has undergone a revolution since methods from Machine Learning (ML) have been applied to it. In this blog post I will explain what NLP is about and show how Machine Learning comes into play. In the end you will have learned which problems NLP deals with, what kinds of methods it uses and how Machine Learning models can be adapted to the specific structure of natural language data.

What is AI

Matthias Werner

July 15th 2019

The terms Artificial Intelligence (AI) , Machine Learning (ML) and Deep Learning (DL) are used with staggering frequency. However, it is not obvious what they are supposed to describe and how they relate to each other. This blog article is meant to elucidate some of the terminology used in these fields for the tech-enthusiastic layman, as well as to give some high-level notion of how these technologies can be used in Computer Vision (CV) .

Extracting information from documents

Frank Weilandt (PhD)

April 23rd 2019

There is a growing demand for automatically processing letters and other documents. Powered by machine learning, modern OCR (optical character recognition) methods can digitize the text. But the next step consists of interpreting it. This requires approaches from fields such as information extraction and NLP (natural language processing) . Here we go through some heuristics how to read the date of a letter automatically using the Python OCR tool pytesseract . Hopefully, you can adapt some ideas to your own project.