Beat Tracking with Deep Neural Networks

Julius Richter

An audio mixer.

This is the last post in the three part series covering machine learning approaches for time series and sequence modeling. In the first post, the basic principles and techniques for serial sequences in artificial neural networks were shown. The second post introduced a recent convolutional approach for time series called temporal convolutional network (TCN), which shows great performance on sequence-to-sequence tasks (Bai, 2018).

In this post, however, I will talk about a real world application which employs a machine learning model for time series analysis. To this end, I will present a beat tracking algorithm, which is a computational method for extracting the beat positions from audio signals. The presented beat tracking system (Davies, 2019) is based on the TCN architecture which captures the sequential structure of audio input.


Rhythm is indispensable for music and builds its central, hierarchical organizing structure. It represents one of the fundamental dimensions in the perception of music and orders the movement of musical patterns in time. Therefore, rhythm analysis is an essential part of understanding music, and represents an innate cognitive ability of humans even without musical education. Especially, the interpretation of meter is a fundamental aspect of musical intelligence, and thus detecting the beat, defined as the most prominent layer of rhythm, is an important initial step in the computer emulation of human music understanding.


From an overall perspective, the beat tracking system comprises three major stages:

In the first stage, the original audio signal is being preprocessed. It includes different methods such as normalization, transformation and feature extraction. The second stage represents the actual machine learning model, which is optimized based on training data. The learned model takes the preprocessed audio as an input and outputs the beat activation which can be seen as a high-level feature of the musical piece. Finally, in the temporal decoding stage a dynamic Bayesian network (DBN) is used to model bars of various lengths and align the predicted beat positions to the global best solution.

To better understand the signal flow of the beat tracking system, the input signals to each stage are visualized in the following figure. For this purpose, the first verse of the track Yesterday by the Beatles is used as an example.

 Signal flow


The dataset used for training is composed of four public available datasets containing 2,124 audio files with 136,897 beat annotations and has a length of approximately 20 hours. The audio files are raw pulse code modulated (PCM) audio signals stored as WAV files. For the sake of consistency and also to reduce computational complexity, every audio signal is resampled at a sampling rate $$f_s = 44.1 \,\text{kHz}$$ with $$16\text{-bit}$$ resolution and converted to a monaural signal by averaging both stereo channels.

The discrete audio signal $$x(n)$$ is segmented into overlapping frames of $$N = 2048$$ samples, which correspond to a length of $$46.4 \,\text{ms}$$. The frames are sampled every $$10 \,\text{ms}$$, resulting in a frame rate $$f_r = 100 \,\text{fps}$$. A standard Hamming window $$w(n)$$ of the same length is applied to the frames before the short-time Fourier transform (STFT) is used to compute the complex spectrogram

$$X(t,\omega) = \sum_{n = 1}^{N} w(n) \, x(n + t\,h) \, e^{-2 \pi j \omega /N},$$

where $$t$$ refers to as the frame index, $$\omega$$ the frequency bin index, and the hop size $$h=441$$, i.e., the time shift in samples between adjacent frames. The complex spectrogram $$X(t,\omega)$$ is converted to the power spectrogram $$|X(t,\omega)|^2$$ by omitting the phase portion of the spectrogram. The power spectrogram is filtered with a bank of overlapping triangular filters $$F(t,\omega)$$ with 12 bands per octave covering a frequency range of $$30$$ to $$17\,000 \, \text{Hz}$$. To better match the human perception of loudness, a logarithmic representation is chosen,

$$S(t,\omega) = \log \left( |X(t,\omega)|^2 \cdot F(t,\omega)^T + 1 \right).$$

Feature Learning

The feature learning stage itself consists of two main blocks. While the filtered log power spectrum could be passed to the TCN directly, the network first seeks to learn some compact intermediate representation by implementing a preceding convolutional block. To capture the sequential structure, a TCN transforms the intermediate representation $$\mathbf z_{1:T}$$ into a sequence of beat activations $$a_{1:T}$$.

The convolutional block is a set of convolution and max pooling layers and it reduces the dimensionality both in time and frequency. All convolutional layers contain $$16$$ filters, with kernel sizes of $$3\times3$$ for the first two, and $$1\times8$$ for the last layer.

The filters move with stride 1 and no zero padding is used. The intermediate max pooling layers apply pooling only in the frequency direction over $$3$$ frequency bins. A dropout (Tompson, 2015) rate of $$0.1$$ is used and the activation function corresponds to an exponential linear unit (ELU) (Clevert, 2015), defined as

$${ f(x)={\begin{cases}e^{x}-1&{\text{for }}x\leq 0\\x&{\text{for }}x>0.\end{cases}}}.$$

The output of the convolutional block is a 16-dimensional feature vector $$\mathbf z$$ and serves as the input to the TCN. The architecture of the TCN consists of 11 stacked residual blocks, as illustrated in the following figure:

Each residual block contains $$16$$ channels with filters of size $$5$$, and the dilation $$d=2^i$$ is geometrically spaced ranging from $$2^0$$ up to $$2^{10}$$.

Thus, the resulting receptive field is approximately $$81.5\,\text{s}$$ long. Again, a spatial dropout with rate $$0.1$$ and the ELU activation function is used. The output layer of the TCN is a fully-connected linear layer with two units, representing the two classes "beat" and "no beat". In order to normalize the output and to ensure computational stability, the log softmax function

$$\log \text{softmax}(\hat{y}_i) = \log\left(\frac{\exp{(\hat{y}_i)}}{\sum_j \exp(\hat{y}_j)}\right)$$

is used which is added to the last layer of the neural network. Hence, the output $$\hat{\mathbf y}_t = (\hat{y}_{1_t}, \hat{y}_{2_t})^T$$ represents the log-probabilities for the two classes at time $$t$$. The network can thus be trained as a binary classifier with the cross entropy loss function

$$L(\mathbf y, \hat{\mathbf y}) = - w_0 \,y_0 \log(\hat{y}_0)-w_1\,y_1 \log(\hat{y}_1).$$

The "beat'' class is weighted with the factor $$w_1=70$$ to compensate the unbalancedness of the two classes. Finally, the beat activation $$a_t$$ can be calculated by

$$a_t = \exp \left({\hat{y}_{1_t}} \right), \quad \forall \, t= 1,\dots,T$$

which represent the probability of a beat at frame $$t$$.


Generally, the metrical structure of a musical piece builds upon a hierarchy of approximate regular pulses. To exploit this sequential structure, a probabilistic dynamic model is used to result in a more robust estimation of the beat instants and to correctly operate under rhythmic fluctuations, such as ritardando and accelarando or metric modulations.

To this end, a DBN is employed which jointly infers the period and phase of the beat. More precisely, the post-processing stage constitutes a hidden Markov model (HMM), in which a sequence of hidden variables that represent the beat structure of an audio piece is inferred from the sequence of observed beat activations. The probabilistic state space consists of two hidden variables, namely the beat period, i.e. the reciprocal of the tempo, and the position inside the beat period (Krebs, 2015).

Given a sequence of beat activations $$a_{1:T}$$, the goal of the hidden Markov model is to identify the most probable hidden state sequence $$\mathbf s_{1:T}^{\ast}$$. For every time frame $$t$$, the hidden state is defined as $$\mathbf s_t = (\phi_t, \tau_t)^T$$ where $$\phi_t \in \{1, 2, \dots, \tau_t\}$$ denotes the position inside a beat period, and $$\tau_t \in \{ \tau_{\text{min}}, \dots, \tau_{\text{max}}\}$$ refers to the length of the current beat period measured in frames. Due to the principle to use exactly one state per frame to indicate the position inside the beat period, the number of position states corresponds exactly to length of the current beat period $$\tau_t$$. Usually, the tempo $$\Theta$$ of a musical piece is measured in beats per minute (BPM), and therefore needs to be converted to the beat period measured in frames, by

$$\tau(\Theta) = \left\lfloor \frac{\Theta}{60}\, f_r \right\rfloor$$

where the frame rate $$fr$$ is defined as $$100$$ frames per second. The limits of the beat period can be easily obtained as $$\tau_\text{min} = \tau(\Theta_\text{max})$$ and $$\tau_\text{max} =\tau(\Theta_\text{min})$$ respectively. Operating on a tempo range from 55 to 215 BPM results in a beat period ranging from $$\tau_\text{min} = 27$$ to $$\tau_\text{max} = 107$$ frames, and thus constructs a state space with 5427 states in total. As an illustration, a toy example with a significantly smaller state space is shown in the following figure.

 State space

With the aforementioned state space discretization, the most likely hidden state sequence $$(\mathbf s_{1:T}^{\ast} = \left((\phi_t^{\ast}, \tau_t^{\ast})^T\right)_{1:T})$$, given a sequence of beat activations $$a_{1:T}$$, is computed by

$$\mathbf s_{1:T}^{\ast} = \underset{\mathbf s_{1:T}}{\mathrm{argmax}} P(\mathbf s_{1:T}\,|\, a_{1:T})$$


$$P(\mathbf s_{1:T}\,|\, a_{1:T}) \propto P(\mathbf s_1) \prod_{t=2}^T P(\mathbf s_t\,|\,\mathbf s_{t-1})\, P( a_t\,|\,\mathbf s_{t}),$$

where $$P(\mathbf s_1)$$ is the initial state distribution, the term $$P(\mathbf s_t\,|\,\mathbf s_{t-1})$$ corresponds to the transition model, and $$P(a_t\,|\,\mathbf s_{t})$$ refers to the observation model. The most likely hidden state sequence $$\mathbf s_{1:T}^{\ast}$$ can be solved using the well-known Viterbi algorithm (Viterbi, 1967). Finally, the set of beat instants $$\mathcal B$$ can be extracted from the sequence of the position within the beat period as

$$\mathcal B = \{ t: \phi_t^{\ast}=1\}.$$

In order to infer the hidden variables from an audio signal, the three entities, the initial distribution, the transition model, and the observation model, need to be specified. The transition model describes the transitions between the hidden variables. The observation model takes the beat activations from the neural network and initial distribution encodes prior knowledge about the hidden variables.

Initial Distribution

With the initial distribution $$P(\mathbf s_1)$$, any prior knowledge about tempo distributions can be incorporated into the model. For instance, if the music to be tracked from one genre, a specific tempo distribution can be used. To make it genre independent, a uniform distribution is used.

Transition Model

The transition model $$P(\mathbf s_t\,|\,\mathbf s_{t-1})$$ can be further decomposed into a distribution for each of the two hidden variables $$\phi_t$$ and $$\tau_t$$, this is

$$P(\mathbf s_t\,|\,\mathbf s_{t-1}) = P(\mathbf \phi_t\,|\,\mathbf \phi_{t-1}, \tau_{t-1}) \, P(\tau_t \, |\, \tau_{t-1}),$$

where the first factor is defined as

$$P(\mathbf \phi_t\,|\,\mathbf \phi_{t-1}, \tau_{t-1}) = \mathbf 1_A$$

with the indicator function $$\mathbf 1_A$$ that equals one if $$\phi_t = (\phi{t-1}+\tau_{t-1}-1)\mod M+1$$. The modulo operator makes the bar position cyclic. If $$\phi_t = \tau_t$$, the second factor is defined by a Laplace distribution

$$P(\tau_t \, |\, \tau_{t-1}) = \frac{\lambda}{2 \, \tau_{t-1}}\exp \left( -\lambda \left| \frac{\tau_{t}}{\tau_{t-1}} -1 \right| \right),$$

otherwise it is

$$P(\tau_t \, |\, \tau_{t-1}) = \begin{cases} 1, &\tau_t = \tau_{t-1}\\ 0, &\text{else.} \end{cases}$$

The parameter $$\lambda \in \mathbb Z_{\geq0}$$ determines the steepness of the distribution and models the probability for a tempo change. A value of $$\lambda = 0$$ means that transitions to all tempos are equally probable. In practice, for music with roughly constant tempo, a value of $$\lambda \in [1, 300]$$ is reasonable. The probability of tempo change is heuristically set to $$p_\omega = 0.002$$. Higher-level or domain specific knowledge could be used to set this parameter. For example in rock or pop music, the beat is usually quite steady, so a small value for $$p_\omega$$ would be appropriate, while for classical music, particularly styles including many tempo changes, a higher value would be more optimal.

Observation Model

The beat activation function produced by the neural network is limited to the range $$[0, 1]$$ and shows high values at beat positions and low values at non-beat positions. Thus, the activation function is used directly as state-conditional observation distributions (Degara, 2012). The observation likelihood is defined as

$$P(a_t\,|\,\phi_t) = \begin{cases} a_t, &1 \leq \phi_t\leq \frac{\lambda}{\Lambda}\\ \frac{1-a_t}{\lambda-1}, &\text{else}. \end{cases}$$


In order to evaluate the performance of the proposed beat tracking system and comparing it against different methods, standard evaluations methods are used which are common in the music information retrieval (MIR) community. In this post, however, I will only refer to the standard reference for beat tracking evaluation (Davies, 2009).

With the development of the Music Information Retrieval Evaluation eXchange (MIREX) the beat tracking task has been addressed as an annual contest for researchers in this field. Check out the latest results of 2019. The code of the proposed beat tracking system can be found here.