Extracting information from technical drawings

Frank Weilandt (PhD)

Technical drawing

Did you ever need to combine data about an object from two different sources, say, images and text? We are often facing such challenges during our work at dida. Here we present an example from the realm of technical drawings. Such drawings are used in many fields for specialists to share information. They consist of drawings that follow very specific guidelines so that every specialist can understand what is depicted on them. Normally, technical drawings are given in formats that allow indexing, such as svg, html, dwg, dwf, etc. but many, especially older ones, only exist in image format (jpeg, png, bmp, etc.), for example from book scans. This kind of drawings is hard to access automatically which makes its use hard and time consuming. In this regard, automatic detection tools could be used to facilitate the search.

In this blogpost, we will demonstrate how both traditional and deep-learning based computer vision techniques can be applied for information extraction from exploded-view drawings. We assume that such a drawing is given together with some textual information for each object on the drawing. The objects can be identified by numbers connected to them. Here is a rather simple example of such a drawing: An electric drill machine.

 Exploded-view image of an electrical drill machine.

There are three key components on each drawing: The numbers, the objects and the auxiliary lines. The auxiliary lines are used to connect the objects to the numbers.

The task at hand will be to find all objects of a certain kind / class over a large number of drawings, e.g. the socket with number 653 in the image above appears in several drawings and even in drawings from other manufacturers. This is a typical classification task, but with a caveat: Since there is additional information for each object accessible through the numbers, we need to assign each number on the image to the corresponding object first. Next we describe this auxiliary task can be solved by using traditional computer vision techniques.

Classical Computer Vision Techniques

We mainly use traditional computer vision techniques to find these components and their relationships. There might be some pretty complex machine learning architecture for doing this, but we experienced that heuristics already work. This way, we also do not need manually created bounding boxes to learn from. Our algorithm presented below goes through all the drawings, marks the boundaries of each object and finds the number attached to this object.

When considering technical drawings, some of the expected challenges are:

  1. Image crowding - technical drawings usually have several, possibly overlapping, objects.

  2. Object variability - the objects present on the image have high variability in size and shape.

  3. Low detail - the drawings consist mostly of contours without texture, color, etc.

  4. Noise/artifacts - some lines do not belong to any object, such as auxiliary lines.

Given the challenge at hand, we divide our approach into three parts.

1. Number Detection

On technical drawings, numbers usually appear in very predictable ways: same font, similar size and orientation. So one can use template matching - at least within drawings from the same company. Using an OCR tool like Tesseract is also possible and generalizes better over drawings from different companies.

2. Image Preprocessing

This step aims to transform the image so that possible artifacts are removed or reduced, improving the performance of the final digits-to-object assignment. In short, we first remove the numbers from the image (cf. step 1). Next we perform an image binarization followed by a transformation from contours into "solid" objects. We close "gaps" in the contours using morphological operations like dilation. Additionally, we mark the interior pixels of each object as foreground. We finish the preprocessing by removing the auxiliary lines. This gives us connected components as shown in the image below.

 Connected components of the drill machine.

3. Number-Object Matching

The strongest hint for matching numbers to objects is by far the distance. By looking at the neighborhood of each number we find the closest object, which can be considered the match. If we did not correctly split the foreground into connected components before, then two numbers might be close to the same component. Then we take each of the two numbers, follow their auxiliary lines and split the component such that each number gets assigned to its own part of the connected component.

After these steps, we can say for every number on the image where the corresponding object is. Easy for humans, but challenging for the computer. It is time to use this information to feed each object's image into a classifier. We have a set of images where we know for each number what class this object has. It is time for some machine learning.

Deep Learning for Object Classification

We take the connected components from the drawing and draw a bounding box around each of them, which gives us several new smaller images we want to classify - one can also include the textual information and let a text classifier and the image classifier vote, but text classifiers are outside the scope of this blog post.


We break down the classification of each image into the following two steps, which gives us the freedom to try different approaches for each step:

- the image encoder: Take the object's image and turn it into a much lower-dimensional vector (say, 512 numbers) which contains the meaningful information for classification.

- the vector classifier: Find the type of the object from the vector produced by the image encoder.

The encoder is a function $$E$$ with input the extracted image Im. The output from $$E$$ is then sent to our vector classifier $$C$$. We call the composition of these two functions the model $$M$$, i.e. $$M = C \circ E$$. Hence, $$M(Im) = C(E(Im))$$ is the class which the algorithm outputs. Of course, both these functions have parameters / weights, and these can be found using training data. Unsurprisingly, we are using neural networks which are very useful for image classification.

The Image Encoder

For all the approaches, we choose our model $$M = C \circ E$$ to be a ResNet18, a function with millions of parameters, where $$E$$ is a convolutional neural network (CNN) and $$C$$ a fully connected layer. First we also load parameter values which were pretrained on the ImageNet classification task. But we need to throw away $$C$$ and replace it by one that suits our needs because $$C$$ was originally used for a different classification task (animals, objects you see on the street, etc.). It is easy to load $$M$$ using PyTorch, TensorFlow or a similar deep learning library.


One can use one of the following two approaches:

- Transfer learning: Use the loaded encoder $$E$$ and train the parameters of the classifier $$C$$ using training data for our specific task.

- Metric learning: Train a distance function which ensures that $$E(Im_1)$$ and $$E(Im_2)$$ are close / similar if and only if the objects on images $$Im_1$$ and $$Im_2$$ are of the same type.

Direct Classification Using Transfer Learning

Here we simply replace the component $$C$$ of the original ResNet18 model by a new linear function with output dimension being the number of object classes. Then the assigned type is decided from checking which of the values in the output vector is the highest. We have two options:

- Just train the parameters of $$C$$ - using the pretrained weights in $$E$$. This approach is useful even though the encoder $$E$$ was trained on photographs.

- Train the parameters $$C$$ and $$E$$. This is only feasible when the number of training examples per object class is high.

In order to generate more training images, we also apply image augmentation. This is a standard technique to create new images from old ones: We use shearing (the object could be observed from a slightly different angle), cropping (the bounding box is not always perfect, anyway), differing brightness/contrast and horizontal flipping.

Metric Learning

This approach is useful if the number of training examples per object type is very low (say, if we have lots of object classes, but only three or five images for each object type). For each pair of images $$Im_1, Im_2$$, we apply the encoder $$E$$ to both and then learn a distance function $$d$$ which should yield $$d(E(Im_1), E(Im_2)) = 0$$ if both images belong to the same class and $$d(E(Im_1), E(Im_2)) = 1$$, otherwise. For classification of an image $$Im$$, we look for the closest training image $$Im_{Train}$$ (i.e. $$d(E(Im), E(Im_{Train}))$$ should be minimal) and assign the type of $$Im_{Train}$$ to $$Im$$.

The parameters of the distance function $$d$$ are learned using the training data. This way, the distance function learns to extract what features are significant for distinguishing object classes.

Metric learning has benefits when adding new object classes: First, a substantial number of labels is only necessary for a subset of the object classes. Under ideal circumstances, a model trained on a subset of classes generalizes well to semantically similar classes. This allows for the number of classes to increase without retraining the model. Second, one can do one-shot learning, i.e. classification based on a single reference sample.


We sketched a pipeline for the problem of interpreting technical drawings. It shows that our projects can usually not be completed by simply training one cats-vs-dogs classifier from the textbook. If the numbers in the drawings do not matter, one can try more standard object detection like YOLO. The problem we presented here is different because we use less images, which are more standardized though and enriched with numbers.

There are still lots of technical drawings which need to be processed automatically. Of course, it would be more convenient to simply have structured CAD data instead of images - but this would also require some kind of agreement on digital standards between the manufacturing companies.