Classification of Crop Fields through Satellite Image Time Series

Tiago Sanona

Crop fields can be classified with machine learning

The field of remote sensing has been benefiting from the advancements made in Machine Learning (ML). In this article we explore a state of the art model architecture, the Transformer, initially developed for Natural Language Processing (NLP) problems, which is now widely used with many forms of sequential data. Following the paper by Garnot et al., we utilize an altered version of this architecture to classify crop fields from time series of satellite images. With this, we achieve better results than traditional methods (e. g. random forests) and with less resources than recurrent networks.

Crop Fields and Satellite Images

The first challenge consists of combining data from different sources and then shape it into a format that can be consumed by the ML model. This is needed because the model ingests time series of randomly selected pixels belonging to parcels, which are then used to infer the type of crop that is present within.

The Common Agricultural Policy (CAP) in the EU, requires subsidized farmers to declare their cultivations yearly. In this report one can find information about which species were planted as well as the coordinates for the contour of the parcel that contains that crop. This accounts for the first essential piece of data, which will constitute the labels for the classification problem and the boundaries used to filter the satellite data that is related to them.

The second part of the data is obtained via satellite images. For this project we opted to use data from Sentinel-2. These are composed of 13 spectral bands, of which we used 10 (2-8, 8A, 11, 12) as the other bands are used to measure aerosols and water vapor concentrations in the atmosphere and thus are not very useful to identify variations in reflectance at the ground level. There are differences in resolution between the bands we used. In our case we decided to convert the 10m resolution images to 20m. Finally the observed values were normalized by channel.

We then combine these two data sources by clipping the original images with the parcel boundaries provided with the labels, this way we were able to map pixels to crop labels.

Classification of Time Series

In this section we briefly explain the architecture of the PSE-TAE model as proposed by Garnot et al., mentioning the differences to our implementation.

1. Pixel-Set Encoder

As images collected from satellites suffer from a low resolution (10-20m), common convolution methods would not be very useful as textural information from the crops would not be discernible. Also parcels can present themselves in a large variety of shapes and sizes, which makes structuring the data in a consistent way difficult. For these reasons the authors proposed a so called pixel set encoder (PSE) architecture.

The idea behind the PSE is to sample different sets of pixels from the satellite images at each training step and passing them through a series of fully connected layers, batch normalizations and ReLUs constituting a multi-layer perceptron (MLP). This generates a model that is capable of embedding each of those multi-channel images of different sizes into a constant dimension tensor.

In the paper the authors also include a vector of pre-computed geometric features as input to the PSE, arguing that after experimenting without a decrease in performance was noticed. In our implementation we decided on skipping this step as of now.

2. Temporal Attention Encoder (TAE)

Now the problem is reduced to finding an embedding for each of the parcels' time series. To this end the authors opted to use a state of the art technique to deal with sequential information. They propose an architecture base on the Transformer from Vaswani et al. with the following alterations:

- The inputs are based on the embeddings generated by the PSE, which is trained at the same time as the attention mechanism, opposing the pre-trained word embedding used in the original model.

- The positional encoder takes in account the number of days since a set date (first observation) rather then the index of the observations, which helps to account for inconsistent temporal sampling of the data. Also because sequences are shorter, the position is divided by $$1000^{2i/d}$$, where $$d$$ is the dimension of the embeddings.

- The query tensors produced by each attention head are pooled into a single master query, as the objective of the network is to classify an entire time series rather than producing an output for each element of the sequence.

 Image 3 - Diagram of TAE. The positional encoding tensor is added to the input embedding (in this case generated by PSE), being then the resulting tensor passed to the attention heads. Each head computes a query ($$q_h$$) and a key ($$k_h$$), through $$FC_1$$.  The queries of each head are then averaged and passed through $$FC_2$$ generating a master query ($$\hat{q}_h$$). Using the embedding added to the positional encoding as the value tensor, the master query and the key, the scaled dot product attention is computed in each head. The results are then concatenated and passed through $$MLP_3$$.

3. Decoding into a class

Finally after all images are embedded in parallel by the PSE and processed by the TAE, a final decoder MLP is added to produce the class logits.


The results that we got from our implementation are detailed by metric and crop in Table 1.

 Table 1 - Validation metrics detailed by crop

From this we can see that there is a variation on the difficulty of classifying crop fields using this method. For example the F1 scores are lower for potatoes, which might be explained by the fact that several types of potatoes are often grown in the same parcel, making it harder for the model to learn a consistent way to predict them.

Would you like to read about related projects? Together with GFZ, we already finished a project for the automatic classification of crop types.