### Temporal convolutional networks for sequence modeling

Updated on January 22nd 2020 by Theory & Algorithms

inThis blog post is the second in a three part series covering machine learning approaches for time series. In the first post, I talked about how to deal with serial sequences in artificial neural networks. In particular, *recurrent* models such as the LSTM were presented as an approach to process temporal data in order to analyze or predict future events.

In this post, however, I will present a simple but powerful *convolutional* approach for sequences which is called Temporal Convolutional Network (TCN). The network architecture was proposed in (Bai, 2018) and shows great performance on sequence-to-sequence tasks like machine translation or speech synthesis in text-to-speech (TTS) systems. Before I describe the architectural elements in detail, I will give a short introduction about sequence-to-sequence learning and the background of TCNs.

#### Sequence-to-Sequence Learning

Sequence modeling, or specifically sequence-to-sequence learning (Seq2Seq), is a task which is about training models to convert sequences from one domain (e.g. written text) to another domain (e.g. the same text synthesized to audio). On the other hand, one speaks about sequence labeling if the input sequence is assigned with a sequence of labels drawn from a fixed alphabet (e.g. part-of-speech tagging).

In general, Seq2Seq models have to deal with input and output sequences of different lengths. Thus, the entire input sequence is required in order to predict the target, e.g., in machine translation where an input sequence in one language is converted into a sequence in another language. In the trivial case, however, where input and output sequences have the same length $$T$$, a sequence model could be described as the function $$f: \mathcal X^T \rightarrow \mathcal Y^T$$, such that

$$\mathbf y_{1:T} = \mathbf y_1, \dots, \mathbf y_T = f(\mathbf x_1,\dots, \mathbf x_T)$$

with input sequence $$\mathbf x_{1:T} := \mathbf x_1, \dots, \mathbf x_T$$, where the vector $$\mathbf x_t \in \mathcal X$$ is the input at time step $$t$$, and the output sequence $$\mathbf y_{1:T}$$, respectively.

If the model is causal, $$\mathbf y_t \in \mathcal Y$$ only depends on $$\mathbf x_{1:t}$$ and not on $$\mathbf x_{t+1:T}$$, i.e. there is no leakage of information from the future. This causality constraint is essential for autoregressive modeling, e.g. in word-level or character-level language modeling.

#### Background

As I have shown in the previous post, recurrent networks are dedicated sequence models that maintain a vector of hidden activations that are propagated through time (Graves, 2012). They have gained tremendous popularity due to prominent applications to language modeling and machine translation.

Convolutional networks, however, have been applied to sequences for decades (Sejnowski, 1987) as well, and they were prominently used for speech recognition in the 80s and 90s. More recently, convolutional networks were applied to sentence/document classification (Zhang, 2015), machine translation (Kalchbrenner, 2016) and audio synthesis (van den Oord, 2016). The results indicate that convolutional architectures can outperform recurrent networks on tasks such as machine translation and audio synthesis.

In the original TCN paper (Bai, 2018), the authors conduct a systematic evaluation of generic convolutional and recurrent networks for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. They ascertain that a simple convolutional architecture like the TCN outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. Thus, they conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks.

#### Temporal Convolutional Network

In the following, you will learn about the TCN structure and its basic architectural elements. It is inspired by recent convolutional architectures for sequential data and combines simplicity, autoregressive prediction, and very long memory. The TCN is designed from two basic principles:

- The convolutions are causal, meaning that there is no information leakage from future to past.
- The architecture can take a sequence of any length and map it to an output sequence of the same length just as with an RNN.

To achieve the first point, the TCN uses causal convolutions, i.e., convolutions where an output at time *t* is convolved only with elements from time *t* and earlier in the previous layer. To accomplish the second point, the TCN uses a 1D fully-convolutional network architecture, where each hidden layer is the same length as the input layer.

##### Dilated Convolution

Simple causal convolutions have the disadvantage to only look back at history with size linear in the depth of the network, i.e. the receptive field grows linearly with every additional layer. To circumvent this fact, the architecture employs dilated convolutions that enable an exponentially large receptive field. More formally, for an input sequence $$\mathbf x \in \mathbb R^T$$ and a filter $$h:\{ 0, \dots, k-1\} \rightarrow \mathbb R$$, the dilated convolution operation $$H$$ on element $$x$$ of the sequence is defined as

$$H(x) = (\mathbf x *_d h)(x) = \sum_{i=0}^{k-1} f(i) \, \mathbf x_{s-d\cdot i}$$

where $$d = 2^\nu$$ is the dilation factor, with $$\nu$$ the level of the network, and $$k$$ is the filter size. The term $$s-d\cdot i$$ accounts for the direction of the past. Dilation is equivalent to introducing a fixed step between every two adjacent filter taps, as it can be seen in the following figure:

Dilated Convolution.

Using larger dilation enables an output at the top level to represent a wider range of inputs, thus effectively expanding the receptive field of a CNN. There are two ways to increase the receptive field of a TCN: choosing lager filter sizes $$k$$ and increasing the dilation factor $$d$$, since the effective history of one layer is $$(k-1) \, d$$.

##### Residual Block

Another architectural element of a TCN are residual connections. In place of a convolutional layer, TCNs employ a generic residual module. Each residual block contains a branch leading out to a series of transformations $$\mathcal F$$, whose outputs are added to the input $$\mathbf x$$ of the block

$$o = \text{Activation} \big(\mathbf x + \mathcal F(\mathbf x)\big).$$

This effectively allows layers to learn modifications to the identity mapping rather than the entire transformation, which has been shown to benefit deep neural networks (He, 2016). Especially for very deep networks stabilization becomes important, for example, in the case where the prediction depends on a large history size ($$> 2^{12}$$) with a high-dimensional input sequence.

A residual block has two layers of dilated causal convolutions and rectified linear units (ReLU) as non-linearities as shown in the following figure:

Residual block.

Weight normalization (Salimans, 2016) is applied to the convolutional filters and a spatial dropout (Srivastava, 2014) is added after each dilated convolution for regularization, meaning that at each training step a whole channel is zeroed out.

#### Conclusion

As you learned in this blog post, the TCN model is deliberately kept simple, combining some of the best practices of modern convolutional architectures. Therefore, it can serve as a convenient but powerful starting point when dealing with sequential data.

TCNs can be build to have very long effective history sizes, which means they have the ability to look very far into the past to make a prediction. To this end, a combination of very deep networks augmented with residual layers and dilated convolutions are deployed.

The TCN architecture appears not only to be more accurate than canonical recurrent networks such as LSTMs and GRUs, it also contains the following properties:

**Parallelism:**Unlike in RNNs where the predictions for later time steps must wait for their predecessors to complete, convolutions can be calculated in parallel because the same filter is used in each layer. Therefore, in both training and evaluation, a long input sequence can be processed as a whole, instead of sequentially as in RNNs.**Flexible receptive field size:**The receptive field size can be changed in multiple ways. For instance, stacking more dilated convolutional layers, using larger dilation factors, or increasing the filter size are all viable options. Thus, TCNs afford better control of the model’s memory size, and are easy to adapt to different domains.**Low memory requirement for training:**Especially in the case of a long input sequence, LSTMs and GRUs can easily use up a lot of memory to store the partial results for their multiple cell gates. In TCNs, however, the filters are shared across a layer, with the back-propagation path depending only on the network depth.

TCN implementations for different ML libraries can be found here: