Blog - Tools


Managing layered requirements with pip-tools


Augusto Stoffel (PhD)


When building Python applications for production, it's good practice to pin all dependency versions, a process also known as “freezing the requirements”. This makes the deployments reproducible and predictable. (For libraries and user applications, the needs are quite different; in this case, one should support a large range of versions for each dependency, in order to reduce the potential for conflicts.) In this post, we explain how to manage a layered requirements setup without forgoing the improved conflict resolution algorithm introduced recently in pip. We provide a Makefile that you can use right away in any of your projects!

The best (Python) tools for remote sensing


Emilius Richter


An estimated number of 906 Earth observation satellites are currently in orbit, providing science and industry with many terabytes of data every day. The satellites operate with both radar as well as optical sensors and cover different spectral ranges with varying spectral, spatial, and temporal resolutions. Due to this broad spectrum of geospatial data, it is possible to find new applications for remote sensing methods in many industrial and governmental institutions. On our website, you can find some projects in which we have successfully used satellite data and possible use cases of remote sensing methods for various industries . Well-known satellite systems and programs include Sentinel-1 (radar) and Sentinel-2 (optical) from ESA, Landsat (optical) from NASA, TerraSAR-X and TanDEM-X (both radar) from DLR, and PlanetScope (optical) from Planet. There are basically two types of geospatial data: raster data and vector data . Raster data Raster data are a grid of regularly spaced pixels, where each pixel is associated with a geographic location, and are represented as a matrix. The pixel values depend on the type of information that is stored, e.g., brightness values for digital images or temperature values for thermal images. The size of the pixels also determines the spatial resolution of the raster. Geospatial raster data are thus used to represent satellite imagery. Raster images usually contain several bands or channels, e.g. a red, green, and blue channel. In satellite data, there are also often infrared and/or ultraviolet bands. Vector data Vector data represent geographic features on the earth's surface, such as cities, country borders, roads, bodies of water, property rights, etc.. Such features are represented by one or more connected vertices, where a vertex defines a position in space by x-, y- and z-values. A single vertex is a point, multiple connected vertices are a line, and multiple (>3) connected and closed vertices are called polygons. The x-, y-, and z-values are always related to the corresponding coordinate reference system (CRS) that is stored in vector files as meta information. The most common file formats for vector data are GeoJSON, KML, and SHAPEFILE. In order to process and analyze these data, various tools are required. In the following, I will present the tools we at dida have had the best experience with and which are regularly used in our remote sensing projects. I present the tools one by one, grouped into the following sections: Requesting satellite data EOBrowser Sentinelsat Sentinelhub Processing raster data Rasterio Pyproj SNAP pyroSAR Rioxarray Processing vector data Shapely Python-geojson Geojson.io Geopandas Fiona Providing geospatial data QGIS GeoServer Leafmap Processing meteorological satellite data Wetterdienst Wradlib

How to implement a labeling tool for image classification in a Jupyter notebook


Felix Brunner


'Hotdog' or 'not hotdog'? That could be the question — at least when performing an image classification task. To be able to address this or a similarly important question by means of a machine learning model, we first need to come up with a labeled dataset for training. That is, we sometimes have to manually look at hundreds or even thousands of images that do or do not contain hotdogs, and decide if they do. One way to do that would be to open up one image at a time and keep track of image classes in another file, e.g., a spreadsheet. However, such a heavy-handed approach sounds rather tedious and is likely prone to fat-fingering errors. Wouldn't it be great if there was a streamlined solution that makes this labeling process more efficient, even fun? That is exactly right and also what we set out to do in this article: Create a simple annotation tool to easily assign class labels to a set of images.

The best image labeling tools for Computer Vision


Dmitrii Iakushechkin


Creating a high quality data set is a crucial part of any machine learning project . In practice, this often takes longer than the actual training and hyperparameter optimization. Thus choosing an appropriate tool for labeling is essential. Here we will have a closer look at some of the best image labeling tools for Computer Vision tasks: labelme labelImg CVAT hasty.ai Labelbox We will install and configure the tools and illustrate their capabilities by applying them to label real images for an object detection task. We will proceed by looking at the above tools one by one. Our collection of computer vision content also clearly shows how central the use of such labeling tools is for us as machine learning specialists.

Understanding and converting MGRS coordinates in Python


Tiago Sanona


Working with satellite data , one needs to understand and possibly convert the coordinates the data is given in. Sometimes, especially if released by official bodies, satellite data is provided in MGRS tiles , which are derived from the UTM coordinate system. For example, this is true for Sentinel-2 tiles. I want to answer the following three questions in this post, using the Python libraries mgrs and pyproj : What is the difference between MGRS and UTM? To which MGRS tile does a certain point referenced in latitude and longitude degrees belong to? How can I express a MGRS tile in Lat/Lon coordinates? Before we answer these questions, let's first look into what MGRS is.

How to identify duplicate files with Python


Ewelina Fiebig


Suppose you are working on an NLP project. Your input data are probably files like PDF, JPG, XML, TXT or similar and there are a lot of them. It is not unusual that in large data sets some documents with different names have exactly the same content, i.e. they are duplicates. There can be various reasons for this. Probably the most common one is improper storage and archiving of the documents. Regardless of the cause, it is important to find the duplicates and remove them from the data set before you start labeling the documents. In this blog post I will briefly demonstrate how the contents of different files can be compared using the Python module filecmp . After the duplicates have been identified, I will show how they can be deleted automatically.

How to extract text from PDF files


Lovis Schmidt


In NLP projects the input documents often come as PDFs. Sometimes the PDFs already contain underlying text information, which makes it possible to extract text without the use of OCR tools. In the following I want to present some open-source PDF tools available in Python that can be used to extract text. I will compare their features and point out some drawbacks. Those tools are PyPDF2 , pdfminer and PyMuPDF . There are other Python PDF libraries which are either not able to extract text or focused on other tasks. Furthermore, there are tools that are able to extract text from PDF documents, but which are not available in Python. Both will not be discussed here. You might also want to read about past dida projects where we developed an information extraction with AI for product descriptions, an information extraction from customer requests or an information extraxction from PDF invoices .

The best free labeling tools for text annotation in NLP


Fabian Gringel


In this blog post I'm going to present the three best free text annotation tools for manually labeling documents in NLP ( Natural Language Processing ) projects. You will learn how to install, configure and use them and find out which one of them suits your purposes best. The tools I'm going to present are brat , doccano , INCEpTION . The selection is based on this comprehensive scientific review article and our hands-on experience of dida's NLP projects . I will discuss the tools one by one. For each of them, I will first give a general overview about what the tool is suited for, and then provide details (or links) regarding installation, configuration and usage. You might also find it interesting to check out our NLP content collection .

Comparison of OCR tools: how to choose the best tool for your project


Fabian Gringel


Optical character recognition (short: OCR) is the task of automatically extracting text from images (coming as typical image formats such as PNG or JPG, but possibly also as a PDF file). Nowadays, there are a variety of OCR software tools and services for text recognition which are easy to use and make this task a no-brainer. In this blog post, I will compare four of the most popular tools: Tesseract OCR ABBYY FineReader Google Cloud Vision Amazon Textract I will show how to use them and assess their strengths and weaknesses based on their performance on a number of tasks. After reading this article you will be able to choose and apply an OCR tool suiting the needs of your project. Note that we restrict our focus on OCR for document images only, as opposed to any images containing text incidentally. Now let’s have a look at the document images we will use to assess the OCR engines.

How Google Cloud facilitates Machine Learning projects


Johan Dettmar


Since not only the complexity of Machine Learning (ML) models but also the size of data sets continue to grow, so does the need for computer power. While most laptops today can handle a significant workload, the performance is often simply not enough for our purposes at dida. In the following article, we walk you through some of the most common bottlenecks and show how cloud services can help to speed things up.