Inside X-ROCKET: Explaining the explainable ROCKET

Felix Brunner

February 14th 2024

Welcome to the bridge, pilot! In this second part of our three-part journey, we will have a detailed look at the interior of the X-ROCKET implementation. After setting the stage of time series classification and a basic introduction of the ROCKET model in part one, this article provides a tour of the necessary mechanisms for explainable embeddings, before part three will launch X-ROCKET into a bumpy space race on a real dataset.

The blueprint to explainability

Again, our goal is to add explainability to a potent time series encoder, the ROCKET. One way to achieve this is by tracing each element of the embedding vectors to its origins and thereby attaching meaning to them. Put differently, if we manage to meaningfully name each embedding element, we effectively transform downstream tasks into tabular problems. Now with the complexities and nonlinearities of neural networks, this is usually easier said than done. In the case of ROCKET, however, the architecture is shallow enough to shed light on its inner workings with a little bit of engineering and trickery. More precisely, the MiniROCKET of Dempster et al. (2021) will serve as a starting point, to which we add transparency by fully backtracking its encoding mechanisms. While convolutions do not necessarily need to be implemented in a deep-learning framework, doing so can help computational speed by leveraging GPUs. Accordingly, there already exist good implementations of various ROCKET variants in Python. For example, the original authors’ numpy code is part of the sktime library, and tsai contains a GPU-ready PyTorch version of it. However, although these implementations are already computationally very efficient, our endeavors require a few changes that are more easily achieved after restructuring the model.

Let’s dive more into the technical details of the X-ROCKET implementation. As mentioned before, ROCKET architectures resemble very simple CNNs, so why not also structure their implementation like a neural network? That is, let’s treat the steps of the calculation as layer objects and plug them together in line with the ideas behind ROCKET. More precisely, we define modules for each calculation step such that it is easier to understand the underlying computational graph. The diagram below schematically presents the full architecture of X-ROCKET. An input time series is served to several dilation blocks in parallel, each of which consists of a convolutional module, a channel mixing module, and a threshold pooling module. After processing the data sequentially in its submodules, each dilation block outputs a vector of embeddings. Finally, these embeddings are concatenated together to form the full X-ROCKET output embedding, which downstream models can pick up to produce a prediction — in our case a classification. Note that the interpretability of the final prediction depends on how explainable the downstream prediction model is. While explainable AI (XAI) is a very active field of research with a whole literature dedicated to making algorithms explainable, we will follow the original authors’ suggestion to use relatively simple prediction heads that are explainable without any additional sophistication.

Full overview of the X-ROCKET architecture.

In what follows, I provide a more detailed look at the various modules that make up X-ROCKET.

ROCKET convolutions

The first step in processing the data is by applying convolutional kernels that scan for fixed patterns in the data. As we are dealing with time series, 1-dimensional kernels are the appropriate choice. The drawing below illustrates how the convolutions are applied. Given a sequence of input data, convolutional kernels are applied by sliding them over the input and summing element-wise products in the respective window. Effectively, this scans the input for the prevalence of the respective pattern and results in an output that has the same shape as the input. Note how in the image below, the output sequence always has large values, when there is a peak in the input. Conversely, the output is negative if there is a dip in the input. This is due to the fact that in this example, the input is filtered for the pattern [-1, 2, -1], which has the shape of a spike itself. X-ROCKET uses the same 84 filters with a length of nine values as suggested in Dempster et al. (2021), but in contrast to the original authors, we always pad the inputs to obtain identical-length output sequences. To maintain explainability in this step, it is enough to store the kernels corresponding to each output sequence.

Illustration of a 1D convolution.

Channel mixing

When dealing with multivariate time series, that is, time series with multiple channels, one might want to consider correlations of patterns in multiple channels. While the original implementation mainly focuses on the univariate case and suggests naïvely adding random combinations of ROCKET convolutions together, we want to provide a balanced comparison of features. Therefore, X-ROCKET removes the randomness and instead provides the option to expand the feature pool with channel combinations up to a chosen order. As an additional option, channels can be combined multiplicatively instead of additively for closer resemblance to the concept of a correlation. Explainability in this step is ensured by remembering the channels the mixed outputs are built with.

Illustration of channel combinations.

PPV threshold pooling

The transformations up to this point have anything but reduced the size of the data. That is, applying multiple convolutional filters to each channel and adding combinations of the input channels on top of single-channel convolutional outputs results in a far greater number of equal-length output channels than were originally put in. Therefore, it is time to collapse the time dimension through a pooling mechanism. Following the original paper’s suggestions, X-ROCKET applies proportion-of-positive-values pooling (PPV). More precisely, the values in each intermediary channel are thresholded at one or more bias values per channel, where the bias values are automatically chosen based on representative examples in an initial fitting step. Then, PPV counts the fraction of values that surpass the respective threshold across the timeline. Finally, the resulting percentages directly serve as feature values in the embedding vector. Hence, for explainability, elements in the embedding can be unambiguously linked to a combination of convolutional kernel, one or more input channels, and a threshold value.

Illustration of proportion-of-positive-values pooling via thresholds.

Dilation blocks

With the considered convolutional kernels only spanning nine observations, the capacity of the model is so far limited to detect a very narrow set of input characteristics. To change that, multiple dilation values are applied to identical kernels simultaneously to widen their receptive fields. X-ROCKET achieves this in practice by executing the aforementioned sequence of convolution, channel mixing, and PPV thresholding in multiple dilation blocks in parallel. In principle, dilations are a standard procedure in the context of CNNs, but most architectures only use a single value at each step. Having said that, a similar idea has recently shown promise to drastically improve the contextual capabilities of LLMs by enlarging context windows through dilated attention (see Ding et al. (2023)). To better understand how filter dilation works, consider the drawing below. Applying a dilation value is spreading the kernel over a longer period of time, and thereby scanning lower frequencies for the respective patterns. For example, the resulting activation with a dilation value of two indicates the occurrence of the pattern at half the data frequency. For explainability, it is therefore important to store the dilation value corresponding to each embedding element as well.

Illustration of frequency dilations.

The full model

Coming back to the full model, we can now put the pieces together. To initialize the encoder, we need to choose a few hyperparameters that determine the exact structure of the model. First, the number of input channels in_channels needs to be specified according to the number of channels in the data. Second, to automatically choose the dilation values to consider, the model requires to set an upper bound for the width of the convolutional receptive fields, called the max_kernel_span. Typically, X-ROCKET then picks 20–30 distinct frequencies to consider. Next, the combination_order determines how many channels are combined together when looking for correlations. By default, this keyword argument is set to 1 for simplicity. Finally, the feature_cap limits the dimensionality of the output to 10,000 features by default. X-ROCKET then builds the feature pool deterministically, that is, it is careful to include all channel-dilation-kernel combinations. Hence, the resulting number of features needs to be a multiple of all possible combinations and is not necessarily close to the specified value. If there is room within the feature cap, multiple thresholds are applied to each channel-dilation-kernel combination in the pooling step to create additional features.

Finally, to turn the embeddings into predictions, the encoder needs to be combined with a prediction model. As we are interested in interpretability, explainable models are the suggested choice here. Having effectively structured the problem tabularly through the X-ROCKET encoder, many models for tabular data are valid candidates. For example, scikit-learn offers a large selection of insightful algorithms for tabular data. Similarly, gradient boosting algorithms such as XGBoost are high-performance alternatives. Note that standardizing the embedding vectors may be an essential intermediary processing step to ensure the interpretability of some of these prediction algorithms. Finally, with the X-ROCKET code living in the PyTorch framework, it is also easy to combine the encoder with a deep feed-forward neural network. However, anything beyond a single linear layer might again be difficult to interpret in this case.

In the next and final part, I will show a simple usage example of the X-ROCKET implementation that also illustrates what kind of insight one can derive from X-ROCKET besides pure predictive performance.

References

Dempster, A., Schmidt, D. F., & Webb, G. I. (2021, August). Minirocket: A very fast (almost) deterministic transform for time series classification. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining (pp. 248–257).
Ding, J., Ma, S., Dong, L., Zhang, X., Huang, S., Wang, W., & Wei, F. (2023). Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486.
Drawings were created in excalidraw.

This article was created within the “AI-gent3D — AI-supported, generative 3D-Printing” project, funded by the German Federal Ministry of Education and Research (BMBF) with the funding reference 02P20A501 under the coordination of PTKA Karlsruhe.