GeoGrapher is a Python library designed to simplify the creation of object-centric remote sensing machine learning (ML) datasets by handling data downloading, processing, and labeling.

Who is GeoGrapher designed for?

GeoGrapher is designed for remote sensing specialists and ML engineers who need to create structured, geo-referenced datasets for training AI models.

What makes GeoGrapher different from other remote sensing tools?

Unlike other tools, GeoGrapher is object-centric, meaning it starts with known geographic objects and builds ML datasets around them, avoiding common issues like class imbalance and data leakage.

Why is object-centric dataset creation important in remote sensing ML?

Object-centric datasets ensure that ML models focus on specific entities, reducing redundancy, improving accuracy, and addressing class imbalance issues common in remote sensing data.

How does GeoGrapher handle large raster datasets?

GeoGrapher enables intelligent cutting of rasters into smaller, optimized segments while ensuring minimal data redundancy and class imbalance.

How does GeoGrapher address the problem of class imbalance?

GeoGrapher tracks containment relationships between objects and rasters, minimizing overrepresentation of clustered objects.

What is a bipartite graph, and how does GeoGrapher use it?

A bipartite graph is a structure with two types of nodes (vector features and rasters) and edges connecting them. GeoGrapher uses it to track relationships between objects and images efficiently.

How does GeoGrapher help prevent data leakage?

GeoGrapher identifies which objects appear in multiple rasters, ensuring proper dataset splits to prevent data contamination.

Can GeoGrapher update existing datasets with new objects?

Yes, GeoGrapher can integrate new objects into existing datasets without unnecessary raster downloads, preventing redundancy.

What types of remote sensing data does GeoGrapher work with?

GeoGrapher supports vector data (e.g., bounding boxes, polygons) and raster data (e.g., satellite images from Sentinel-2, DEMs).

How does GeoGrapher generate segmentation masks?

GeoGrapher converts vector feature polygons into pixel-wise segmentation masks, labeling regions for ML tasks like object detection and semantic segmentation.

Where can I find a demonstration of GeoGrapher in action?

A practical demonstration of GeoGrapher is available in a Jupyter Notebook on GitHub: GeoGrapher Notebook: https://github.com/dida-do/GeoGrapher/blob/main/notebooks/blogpost.ipynb

GeoGrapher: An open-source Python library for building object-centric machine learning datasets from remote sensing data

Rustam Antia (PhD)

March 17th 2025

We’re excited to announce the release of GeoGrapher, an open-source Python library for building object-centric machine learning datasets from remote sensing data. In this blog post, we explore the challenges remote sensing specialists and ML engineers face when constructing such datasets and how GeoGrapher streamlines the process, making it easier to create well-structured, machine-learning-ready datasets.

1. Introduction

Satellite imagery combined with modern machine learning (ML) techniques has transformed remote sensing applications. From environmental monitoring to infrastructure assessment, ML models trained on remote sensing data unlock valuable insights. However, before these models can be built, an essential first step is dataset creation - often a surprisingly complex challenge.

In particular, building object-centric remote sensing datasets - where data is organized around specific geographic objects - presents unique difficulties. Unlike what we might call area-centric datasets which start with remote sensing images and annotate objects afterward, an object-centric dataset starts with a predefined set of objects and their locations, then constructs a dataset by gathering relevant remote sensing imagery. This approach is particularly useful for applications where the focus is on detecting or analyzing specific types of objects, such as buildings, roads, or mines, rather than performing broad land cover classification.

Fig. 1: Starting point for object-centric building of a dataset: A set of objects and their geographic locations (in this case polygons)

However, going from a set of objects - which could in the simplest case just be a list of latitude/longitude coordinates, but also segmentation masks or bounding boxes - to a dataset can be surprisingly challenging. This is where GeoGrapher comes in. GeoGrapher is an open-source Python library designed to simplify the process of constructing object-centric datasets from remote sensing data. By tracking spatial relationships between objects and satellite imagery, it automates key steps in dataset preparation, making it easier for ML and remote sensing specialists to focus on model development rather than data wrangling.

In the next section we will discuss the challenges when building a ML dataset of remote sensing data from a list of objects and their locations (or more generally bounding boxes or segmentation masks) and how the Python library GeoGrapher solves them.

2. Challenges in constructing object-centric datasets

Example: Building a dataset of sports stadiums

To illustrate the challenges of object-centric dataset creation, let's consider the task of semantic segmentation for sports stadiums. Our goal is to build a dataset where each satellite image patch contains a segmentation mask, identifying which pixels correspond to a stadium.

At the core of such a task, we are working with two fundamental types of remote sensing data:

Vector data: Used to represent discrete geographic objects, such as points, lines, or polygons. In this case, each stadium is represented as a polygon, which is a list of latitude/longitude coordinates defining its boundary. These objects are referred to as vector features.
Raster data: Grid-based data where each pixel corresponds to a geographic location. The satellite images we use in the dataset are examples of raster data, where pixel values represent RGB intensities. Beyond optical imagery, rasters can also store other geospatial information such as elevation, humidity, or population density.

While segmentation is our primary focus, the same dataset could also support other computer vision tasks, such as object detection (bounding boxes around stadiums¹). Regardless of the task, constructing the dataset requires precise spatial alignment between the stadium polygons and the satellite imagery to ensure accurate labels.

At first glance, creating a segmentation dataset from a list of stadiums seems straightforward: retrieve satellite images and generate segmentation masks. However, in practice, several challenges arise. If we were to build the dataset manually - without GeoGrapher - the process would involve multiple steps:

Downloading rasters: We need to obtain rasters for the stadium locations such as imagery from European Space Agency’s Sentinel-2 satellite mission.
Cutting the dataset: Satellite rasters are often very large. For example ESA’s Sentinel-2 raster tiles are 10980 by 10980 pixels (100km x 100km). Depending on the number of spectral bands, a single tile can be up to 1 GB in size. Modern deep learning techniques require the use of a GPU and these rasters would be much too large to fit into memory. We thus need to create a dataset of smaller rasters on which we can train our ML model. The particular way in which we would cut the rasters depends on the application. In this example, stadiums are very small compared to the size of the full 100km by 100km rasters so if we were to for example cut each 100km by 100km tile into a grid of smaller rasters there will still be a strong foreground/background class imbalance in the dataset of smaller rasters. And in many cases, we may not know if the full 100km × 100km rasters contain additional, unlabeled stadiums. These false negatives - stadiums present in the imagery but missing from the labels - could mislead the model during training. To mitigate false negatives, a better approach is to extract targeted cutouts centered around each stadium.
Creating labels: To create labels, we need to convert stadium polygons (lists of latitude/longitude coordinates) into segmentation masks. A segmentation mask is a raster image where each pixel indicates whether it belongs to a stadium or not.
For binary segmentation, the mask simply distinguishes stadium (foreground) from non-stadium (background). In multi-class segmentation, we can further differentiate between object types, such as football vs. track and field stadiums.
Ensuring that these masks are accurately aligned with the satellite imagery is crucial, as misalignment could degrade model performance.

The challenges

When constructing an object-centric remote sensing dataset, several issues arise. In all cases, the key challenge is managing the spatial relationships between vector features (stadiums) and rasters (satellite images). Specifically, we need to track which vector features are contained in or intersect with each raster to optimize dataset construction.

Downloading rasters: A naive approach downloads a separate raster for each stadium. However, stadiums often cluster together, such as in an Olympic village with multiple sports venues. If each stadium triggers a new raster download, we end up with redundant images, where the same region appears in multiple rasters.
This causes two major problems: inefficiency, as unnecessary downloads increase storage and processing costs, and class imbalance, since stadiums in clusters are overrepresented, skewing the dataset.

To avoid this, we should check each downloaded raster for additional stadiums before downloading new ones. This ensures that each raster captures as many relevant objects as possible, reducing redundancy and maintaining dataset balance.

Fig. 2: Sentinel-2 raster containing a cluster of stadiums in and near the Olympic Park in Munich, Germany

Fig. 3: Schematic illustration of class imbalance caused by clustering of nearby objects if separate rasters are downloaded for each object. Three of the four objects in the cluster on the left are contained in more than one raster.

Cutting rasters: If we naively extract a separate cutout for each stadium, clusters of nearby stadiums will result in overrepresentation - stadiums in a cluster will repeatedly appear in multiple cutouts. This skews the dataset, causing class imbalance and making model training less effective.
To avoid this class imbalance, we need to know the containment relations between vector features and rasters.
Generating labels: When converting stadium polygons into segmentation masks, we must ensure that each raster correctly aligns with the relevant vector features. This means we need to determine which stadiums intersect with each raster to generate accurate pixel-wise labels.
Updating the dataset: As new stadiums are added, we need to determine whether existing rasters already cover them or if new data downloads are required. If we naively download new rasters for every additional stadium, we introduce the same redundancy and class imbalance issues caused by clustering.
By tracking which vector features are already contained within existing rasters, we can avoid unnecessary downloads and ensure efficient dataset expansion without bias.

Fig. 4: Schematic example of a dataset to which new objects (marked in red) have been added. A new raster needs to be downloaded for the new object on the right, but not for the new object on the left, which is already contained in an existing raster.

Avoiding data leakage: To train a reliable ML model, we must split the dataset into training, validation, and test sets. However, naive raster-based or vector-feature-based splits can lead to data leakage, where parts of the same stadium appear in multiple splits.
This happens because stadiums may intersect multiple rasters, resulting in different views of the same object being included in both training and validation sets. Additionally, overlapping rasters can lead to entire stadiums being duplicated across splits.

To prevent data leakage, we need to track which stadiums are contained within each raster and ensure that all rasters containing the same stadium belong to the same dataset split. This guarantees that the model is evaluated on truly unseen data, leading to more reliable performance metrics.

Fig. 5: Schematic example to illustrate possible data leakage. In this dataset, in which each of the objects is contained in two rasters, all the images in either of the two clusters of rasters need to be assigned to the same split (train/validation/test) to avoid data leakage, otherwise there will be objects that appear in distinct splits.

Tracking which rasters contain which objects is useful beyond dataset creation. After training a model, we might inspect all rasters containing a specific stadium for further analysis, debugging, or updates.

Bipartite Graphs to the rescue!

All key operations in dataset construction rely on tracking which vector features (stadiums) are contained in or intersect with which rasters (satellite images). A natural way to represent these relationships is through a bipartite graph.

What is a Bipartite Graph?

A bipartite graph consists of two distinct sets of nodes connected by edges. In our case:

One set of nodes represents vector features (stadiums).
The other represents rasters (satellite images).
Edges represent either containment (a stadium fully inside a raster) or intersection (a stadium partially overlapping a raster).

Fig. 6: Illustration of a bipartite graph encoding containment and intersection relations between vector features and rasters. A vector feature may be contained in or intersect with more than one raster.

Each stadium connects only to rasters, and each raster only to stadiums - never to other rasters or stadiums. Since we track two types of relationships (containment and intersection), the edges can be visualized using different colors (e.g., blue and green in the figure below).

Using a bipartite graph allows one to circumvent the problems mentioned earlier. However, manually managing these relationships is cumbersome. The new GeoGrapher library automates this process, eliminating the need for manual management.

3. Introducing GeoGrapher

We’re excited to introduce GeoGrapher, an open-source Python library for building object-centric remote sensing ML datasets. It simplifies the process of transforming a set of known objects into a structured, labeled dataset ready for machine learning training and inference, while addressing key challenges related to object clustering such as artificial class imbalance and redundant data processing.

At its core, GeoGrapher maintains a bipartite graph that tracks containment and intersection relationships between vector features and rasters. This graph-driven approach helps minimize² excessive clustering, prevent redundant downloads, optimize raster selection, and ensure objects are assigned correctly across dataset splits.

GeoGrapher is designed to be easy to use while offering the flexibility needed for more complex workflows.

Key features:
GeoGrapher simplifies dataset preparation and provides tools for efficient dataset management and analysis:

Downloading remote sensing imagery: Supports most remote sensing data sources via eodag, avoiding redundant downloads by tracking object coverage. It is easily extendable to new sources via a general downloader interface³.
Customizable cutting of datasets: Provides highly flexible raster-cutting functionality while using graph-based tracking to minimize clustering-related issues, such as class imbalance and unnecessary duplication.
Creating labels from vector features: Converts polygon vector features into segmentation masks for machine learning tasks.
Updating datasets without redundancy: Ensures efficient updates by checking existing raster coverage, preventing unnecessary downloads.
Preventing data leakage: Provides functionality to split datasets while ensuring that objects remain within a single partition (train/validate/test), preventing data leakage and ensuring proper model evaluation.
Graph-based querying: Allows users to retrieve all rasters that contain or intersect a given vector feature (e.g., a stadium) or find all vector features contained in or intersecting a specific raster. This enables flexible dataset inspection, debugging, and analysis.

Fig. 7: Illustration of clustering minimization. Left: Naive approach where a new raster is cut or downloaded for each object results in clustering. Right: Applying the heuristic results in a single image for the four objects in the top left, avoiding unnecessary clustering of rasters.

Built for remote sensing ML workflows

By handling complex spatial relationships automatically, GeoGrapher allows remote sensing specialists and ML engineers to focus on model development rather than dataset wrangling. We’re excited to share GeoGrapher and hope it proves useful for your projects.

GeoGrapher in action

Want to see GeoGrapher in action? Check out our introductory notebook, where we show you how to use GeoGrapher to:

Download Sentinel-2 satellite images for stadiums.
Generate segmentation labels automatically.
Cut the raw images into an ML-ready dataset.

Explore the notebook here. →

We look forward to seeing how you use GeoGrapher in your projects. Let us know your thoughts, and feel free to contribute or provide feedback!

¹While GeoGrapher supports arbitrary vector features in its bipartite graph, label generation for bounding boxes for object detection is not currently implemented. The challenge is that bounding boxes are typically defined relative to a fixed image grid, whereas satellite rasters are oriented based on the sensor's position and viewing angle at the time of capture. As a result, raster boundaries may not align with the drawn bounding boxes, making object detection workflows not straightforward.

²The minimization is a simple heuristic which seems to work well in practice. An optimal minimization would involve a variant of the NP-complete geometric set cover problem and is intractable. In future releases we might improve the heuristic.

³Processing requires additional processor implementations tailored to the datasource or producttype.