Detecting subtle events in videos with YOLO-in-time

Ma Li (PhD) May 23rd, 2025

In this talk, we present a general-purpose model architecture that is suitable for detecting and counting subtle events in videos. It combines a convolutional neural network (CNN), a recurrent neural network (RNN) and a YOLO-type head in the time domain. The model can be trained end-to-end with time-resolved labels only, without the need for customization or domain-specific knowledge. We discuss possible use cases and show experimental results both on synthetic datasets and on real world datasets.