This post is the first part of a series of posts that are linked together as they all deal with the topic of time series and sequence modeling, respectively. In order to give a comprehensive piece of content easy to grasp, the series is segmented into three parts:

- How to deal with time series and serial sequences? A recurrent approach.
- Temporal Convolutional Networks (TCNs) for sequence modeling.
- Beat tracking in audio files as an application of sequence modeling.

#### Introduction

Time underlies many interesting physical behaviors and is certainly one of the most challenging concepts. Almost all human activity has temporal extent and reveals a variety of serially ordered action sequences. Consider, for instance, the sequence of human limb movements. Also speech appears to involve sequences of events that follow one another in time.

In this post, you will find answers to the following questions:

- How to represent time in artificial neural networks?
- How to process temporal data in order to analyze or predict future events?
- How do recurrent neural networks work?
- What are long-term dependencies and why can LSTM networks deal with them?

#### Motivation

Time series appear in any domain of applied science and engineering which involves temporal measurements, e.g., weather monitoring/forecasting, or measuring annual population data. Temporal sequences also occur in mathematical finance, e.g., the yearly evolution of a stock index.

Evolution of a stock market index.

In economy, companies generally want to increase their sales. To this end, data scientists predict future market trends from data of the past, such as store, promotion, and competitor data. Time series forecasting enables companies to create better schedules that increase efficiency and productivity, and also reinforces employees' motivation.

#### Time Representation

By definition, a time series is a series of data points that are listed in time order. Most commonly, a time series has equal spacing between two measurements that follow each other. Thus, a time series can be regarded as a sequence of discrete-time data and covers a continuous time interval. Correct ordering within a time series is very important, because there is a dependency on time.

When you deal with data that has a temporal extent, one way to represent time is to explicitly associate the serial order with one dimensionality of the data. Time series data differs from cross-sectional data in the way that there exist a natural temporal ordering of the observations. It is also distinct from spatial data where the observations typically relate to geographical locations.

Generally, a distinction is made between *time series analysis* and *time series forecasting*. Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Whereas, time series forecasting is the use of a model to predict future values based on previously observed values.

Analyzing a time series data is usually focused on classification, clustering or anomaly detection. Time series analysis can be applied to real-valued, continuous data, discrete numeric data, or discrete symbolic data, i.e., sequences of characters, such as letters and words in language.

In time series forecasting, historical data is used to forecast the future. A forecast is referred to a prediction which sometimes includes a confidence interval that expresses the level of certainty. Using a stochastic model for time series will generally reflect the fact that observations close together in time will be more closely related than observations further apart. However, there might also exist long term dependencies. In language, for instance, consider the following two examples:

- The
**cat**, which already ate, ......... **was** full. - The
**cats**, which already ate, ......... **were** full.

It is clear that the predicate in the sentence depends on whether the subject is singular or plural. In order to build a model for this case, information is needed from previous words in the sentence. To get a better insight about how to deal with natural language data, I recommend reading our blog post about natural language processing (NLP).

#### Processing Time Series

In data processing, the serial nature of temporal data appears to be at odds with the parallel nature of digital computation. Therefore, a common approach of processing temporal data is the attempt to "parallelize time" by giving it a spatial representation. This basically means, that the temporal data is sequentially segmented into vectors with a fix length where time is *explicitly* represented as one dimension. This approach, however, requires some interface with the world, which buffers the input, so that it can be presented all at once. And it is not clear that biological systems make use of such shift registers. Furthermore, this approach does not easily distinguish relative temporal position from absolute temporal position. Consider, for instance, a simple time lag which is represented in the following two sequences.

$$\mathbf a = (0,1,1,1,0,0,0,0), \\
\mathbf b = (0,0,1,1,1,0,0,0).
$$

The vectors **a** and **b** are instances of the same basic pattern, but displaced in space. The geometric interpretation of these vectors makes clear, that the two patterns are in fact quite dissimilar and spatially distant. As mentioned before, the spatial representation of time treats time as an explicit part of the input.

Instead, representing time *implicitly* by its effects on processing rather than explicitly as in a spatial representation, the relative temporal structure is preserved in the face of absolute temporal displacements (Elman, 1990). For this purpose, the processing system should contain dynamic properties that are responsive to temporal sequences. One approach to implement these properties is to employ recurrent links in order to provide networks with a dynamic memory (Jordan, 1986).

In machine learning, there are several types of models that can be used for time series analysis/forecasting, e.g., random forest, gradient boosting, or time delay neural networks in which temporal information can be included through a set of delays that are added to the input, so that the data is represented at different points in time. In the the next section, however, I will introduce recurrent neural networks (RNNs). Particularly, you will learn how sequences and actions might be learned and performed for this class of artificial neural networks.

#### Recurrent Neural Networks

In deep learning, a recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence (Rumelhart, 1988). Unlike feedforward neural networks, RNNs contain cycles and use an internal state memory *h* to process sequences of inputs. A basic recurrent neural network is described by the propagation equations,

$$\begin{align}
\mathbf h_t &= \sigma (\mathbf U \cdot \mathbf x_t + \mathbf W \cdot \mathbf h_{t-1} + \mathbf b) \\
\mathbf o_t &= \mathbf V \cdot \mathbf h_t + \mathbf c
\end{align}
$$

where the parameters are the bias vectors **b** and **c** along with the weight matrices **U**, **V** and **W**, respectively, for input-to-hidden, hidden-to-output and hidden-to-hidden connections. The computational graph and its unfolded version is shown in the following figure:

Computing the gradients involves performing a forward propagation pass through the unrolled graph followed by a backward propagation pass. The runtime is $$O(T)$$ and cannot be reduced by parallelization because the forward propagation graph is inherently sequential, i.e., each time step may be computed only after the previous one. Therefore, back-propagation for recurrent model is called backpropagation through time (BPTT). Recurrent models construct very deep computational graphs by repeatedly applying the same operation at each time step of a long temporal sequence. This gives rise to the vanishing gradient problem and makes it notoriously difficult to train RNNs.

To prevent these difficulties more elaborate recurrent architectures were developed, such as the long short-term memory (LSTM) (Hochreiter, 1997) and the gated recurrent unit (GRU) (Cho, 2014). These families of architectures have gained tremendous popularity due to prominent applications to language modeling and machine translation. In the next section, I will briefly describe the LSTM architecture and why it prevents gradients from vanishing.

#### Long Short-Term Memory (LSTM)

The long short-term memory (LSTM) is special recurrent neural network (RNN) architecture used in the field of deep learning. An LSTM is very effective solution for addressing the vanishing gradient problem and will allow neural networks to capture much longer range dependencies.

A basic LSTM cell comprises a memory (cell state) to allow the network to accumulate information over a long duration. However, once that information has been used it might be useful for the network to forget the old state. To this end, the architecture contains three gates that update and control the cell state, these are the forget gate, input gate and output gate.

Instead of manually deciding when to clear the state, the neural network learns to decide when to do it. The time scale of integration, that means the temporal receptive field or the effective history size, can be changed dynamically by making the weights gated, i.e., controllable by another hidden unit.

For an in-depth review of LSTM networks, I recommend reading Colah’s blog. Furthermore, to understand why gradients do not vanish, I recommend reading this blog post.