Anomaly Detection in Track Scenes

Detect anomalous objects from videos taken by cameras on a train.


Large set of unlabelled videos and small set of labelled videos, all taken by cameras on the train


Masks/bounding boxes of objects in the scene, with an anomaly score for each object


Detect objects within the volume surrounding the upcoming path of the train and estimate how anomalous they are


Within the sector initiative “Digitale Schiene Deutschland”, our client Deutsche Bahn is developing an automated driving system for trains. In order to assure that this system operates safely, anomalous/hazardous objects must be detected automatically. It is intentionally required that the system does not simply detect objects within a given collection of classes (such as people, signals or vehicles), but rather has the ability to detect any object and rank them by how anomalous they are. In a series of two projects, we have created a pipeline that can detect objects from RGB videos.

Starting Point

Traditionally, it’s the train drivers who are responsible for detecting anomalous objects during driving. Our goal is to build a machine learning system to perform this task.

As dataset, Digitale Schiene Deutschland provides us with OSDAR23 , an open dataset with annotations of twenty classes of objects, which we use both for finetuning our model and for evaluating the final results. Besides, we are also granted access to a larger amount of unannotated data, which are used for self-supervised learning.

The OSDAR23 dataset contains 45 scenes. Each scene contains images taken by several RGB cameras and infrared cameras, together with radar and lidar data. In the unannotated data we only have images taken by RGB images.


While it is a relatively easy task to detect objects from a given collection of classes, our aim is to build a system that can detect any possibly anomalous object. This requirement poses significant difficulty, as we cannot simply train a model using the annotations in the dataset, but rather have to develop other methods to segment objects.

Another difficulty comes from the definition (or interpretation) of "anomalous objects". Although the dataset is annotated with different classes of objects, the majority of them, such as people standing on the platform of a station, are considered "normal" and not hazardous to the train. Therefore we lack examples of anomalous objects and have to find a suitable way to interpret the term.


Our approach to the problem begins with the following crucial observation: an anomalous object always "sticks out" from the flat surface that it stands (ground, platform, etc.), thus corresponds to a local minimum in the depth map.

A "depth map" for a scene is an image where the value at each pixel is equal to the distance of the surfaces of scene objects from the lens of the camera. In the above, the right image is an estimated depth map of the left image. Note that the objects (sign, pole and tree) “stick out” from their backgrounds.

Based on this idea, we build our solution in three steps:

  1. given a video frame, produce an estimated depth map;

  2. using the depth map as guideline, pass the frame to an object segmentation model to obtain masks of objects;

  3. sort and select the most relevant objects, again with the help of the depth map.

Technical Background

We make use of two main tools: monocular depth estimation and the segment anything model

1. Monocular depth estimation

For depth estimation, we use MonoViT, a self-supervised learning model based on vision transformer.

The original model was trained with the KITTI dataset, which consists of videos taken on a car; to make the model more suitable to our usage, we finetune it using the unannotated data from Digitale Schiene Deutschland.

The finetuning works by training two models at the same time: a depth model, which computes an estimation of depth map from a single RGB image; and a PoseNet model, which estimates the 3D transformation between two frames in a video. Given two frames A and B (usually consecutive) from a video, we use the estimated transformation between them to transform the 3D point cloud obtained from frame A and its estimated depth map. After reprojection to the view of frame B, we get a reconstruction of frame B. We then train the models to minimize this image reconstruction error.

The performance of the model is visibly improved after finetuning.





We have compared the output of MonoViT with the lidar data available in the OSDAR23 dataset. We found that the lidar data is not accurately aligned with the RGB images, but they can still be used as a sanity check.

2. Segment anything model

In the first project phase, we use a watershed algorithm from classical image processing on the depth map to segment objects. There are two issues with this approach:

  1. The resolution of the depth map is relatively low

  2. The boundary is not well defined, e.g. the foot of a person standing on the ground has the same depth as the ground.

This step gets improved in the second project phase, where we use the pretrained Segment Anything Model (SAM) to produce high quality masks of the objects. The depth map helps us to locate "interesting" points on the image, which we process with SAM as prompts.

Combining the two steps, we get a large quantity of masks of segmented objects. The final step consists of sorting and selecting the masks, using information from the depth map, which allows us to turn the 2D image into a 3D point cloud. For example, we trim off objects that are too far away from the camera, or expand a too large volume in 3D space.

Related projects