We make use of two main tools: monocular depth estimation and the segment anything model
1. Monocular depth estimation
For depth estimation, we use MonoViT, a self-supervised learning model based on vision transformer.
The original model was trained with the KITTI dataset, which consists of videos taken on a car; to make the model more suitable to our usage, we finetune it using the unannotated data from Digitale Schiene Deutschland.
The finetuning works by training two models at the same time: a depth model, which computes an estimation of depth map from a single RGB image; and a PoseNet model, which estimates the 3D transformation between two frames in a video. Given two frames A and B (usually consecutive) from a video, we use the estimated transformation between them to transform the 3D point cloud obtained from frame A and its estimated depth map. After reprojection to the view of frame B, we get a reconstruction of frame B. We then train the models to minimize this image reconstruction error.
The performance of the model is visibly improved after finetuning.
Before: