GeoGrapher: An open-source Python library for building object-centric machine learning datasets from remote sensing data
Rustam Antia (PhD)

We’re excited to announce the release of GeoGrapher, an open-source Python library for building object-centric machine learning datasets from remote sensing data. In this blog post, we explore the challenges remote sensing specialists and ML engineers face when constructing such datasets and how GeoGrapher streamlines the process, making it easier to create well-structured, machine-learning-ready datasets.
1. Introduction
Satellite imagery combined with modern machine learning (ML) techniques has transformed remote sensing applications. From environmental monitoring to infrastructure assessment, ML models trained on remote sensing data unlock valuable insights. However, before these models can be built, an essential first step is dataset creation - often a surprisingly complex challenge.
In particular, building object-centric remote sensing datasets - where data is organized around specific geographic objects - presents unique difficulties. Unlike what we might call area-centric datasets which start with remote sensing images and annotate objects afterward, an object-centric dataset starts with a predefined set of objects and their locations, then constructs a dataset by gathering relevant remote sensing imagery. This approach is particularly useful for applications where the focus is on detecting or analyzing specific types of objects, such as buildings, roads, or mines, rather than performing broad land cover classification.

However, going from a set of objects - which could in the simplest case just be a list of latitude/longitude coordinates, but also segmentation masks or bounding boxes - to a dataset can be surprisingly challenging. This is where GeoGrapher comes in. GeoGrapher is an open-source Python library designed to simplify the process of constructing object-centric datasets from remote sensing data. By tracking spatial relationships between objects and satellite imagery, it automates key steps in dataset preparation, making it easier for ML and remote sensing specialists to focus on model development rather than data wrangling.
In the next section we will discuss the challenges when building a ML dataset of remote sensing data from a list of objects and their locations (or more generally bounding boxes or segmentation masks) and how the Python library GeoGrapher solves them.
2. Challenges in constructing object-centric datasets
Example: Building a dataset of sports stadiums
To illustrate the challenges of object-centric dataset creation, let's consider the task of semantic segmentation for sports stadiums. Our goal is to build a dataset where each satellite image patch contains a segmentation mask, identifying which pixels correspond to a stadium.
At the core of such a task, we are working with two fundamental types of remote sensing data:
Vector data: Used to represent discrete geographic objects, such as points, lines, or polygons. In this case, each stadium is represented as a polygon, which is a list of latitude/longitude coordinates defining its boundary. These objects are referred to as vector features.
Raster data: Grid-based data where each pixel corresponds to a geographic location. The satellite images we use in the dataset are examples of raster data, where pixel values represent RGB intensities. Beyond optical imagery, rasters can also store other geospatial information such as elevation, humidity, or population density.
While segmentation is our primary focus, the same dataset could also support other computer vision tasks, such as object detection (bounding boxes around stadiums1). Regardless of the task, constructing the dataset requires precise spatial alignment between the stadium polygons and the satellite imagery to ensure accurate labels.
At first glance, creating a segmentation dataset from a list of stadiums seems straightforward: retrieve satellite images and generate segmentation masks. However, in practice, several challenges arise. If we were to build the dataset manually - without GeoGrapher - the process would involve multiple steps:
Downloading rasters: We need to obtain rasters for the stadium locations such as imagery from European Space Agency’s Sentinel-2 satellite mission.
Cutting the dataset: Satellite rasters are often very large. For example ESA’s Sentinel-2 raster tiles are 10980 by 10980 pixels (100km x 100km). Depending on the number of spectral bands, a single tile can be up to 1 GB in size. Modern deep learning techniques require the use of a GPU and these rasters would be much too large to fit into memory. We thus need to create a dataset of smaller rasters on which we can train our ML model. The particular way in which we would cut the rasters depends on the application. In this example, stadiums are very small compared to the size of the full 100km by 100km rasters so if we were to for example cut each 100km by 100km tile into a grid of smaller rasters there will still be a strong foreground/background class imbalance in the dataset of smaller rasters. And in many cases, we may not know if the full 100km × 100km rasters contain additional, unlabeled stadiums. These false negatives - stadiums present in the imagery but missing from the labels - could mislead the model during training. To mitigate false negatives, a better approach is to extract targeted cutouts centered around each stadium.
Creating labels: To create labels, we need to convert stadium polygons (lists of latitude/longitude coordinates) into segmentation masks. A segmentation mask is a raster image where each pixel indicates whether it belongs to a stadium or not.
For binary segmentation, the mask simply distinguishes stadium (foreground) from non-stadium (background). In multi-class segmentation, we can further differentiate between object types, such as football vs. track and field stadiums.
Ensuring that these masks are accurately aligned with the satellite imagery is crucial, as misalignment could degrade model performance.
The challenges
When constructing an object-centric remote sensing dataset, several issues arise. In all cases, the key challenge is managing the spatial relationships between vector features (stadiums) and rasters (satellite images). Specifically, we need to track which vector features are contained in or intersect with each raster to optimize dataset construction.
Downloading rasters: A naive approach downloads a separate raster for each stadium. However, stadiums often cluster together, such as in an Olympic village with multiple sports venues. If each stadium triggers a new raster download, we end up with redundant images, where the same region appears in multiple rasters.
This causes two major problems: inefficiency, as unnecessary downloads increase storage and processing costs, and class imbalance, since stadiums in clusters are overrepresented, skewing the dataset.
To avoid this, we should check each downloaded raster for additional stadiums before downloading new ones. This ensures that each raster captures as many relevant objects as possible, reducing redundancy and maintaining dataset balance.


Cutting rasters: If we naively extract a separate cutout for each stadium, clusters of nearby stadiums will result in overrepresentation - stadiums in a cluster will repeatedly appear in multiple cutouts. This skews the dataset, causing class imbalance and making model training less effective.
To avoid this class imbalance, we need to know the containment relations between vector features and rasters.
Generating labels: When converting stadium polygons into segmentation masks, we must ensure that each raster correctly aligns with the relevant vector features. This means we need to determine which stadiums intersect with each raster to generate accurate pixel-wise labels.
Updating the dataset: As new stadiums are added, we need to determine whether existing rasters already cover them or if new data downloads are required. If we naively download new rasters for every additional stadium, we introduce the same redundancy and class imbalance issues caused by clustering.
By tracking which vector features are already contained within existing rasters, we can avoid unnecessary downloads and ensure efficient dataset expansion without bias.

Avoiding data leakage: To train a reliable ML model, we must split the dataset into training, validation, and test sets. However, naive raster-based or vector-feature-based splits can lead to data leakage, where parts of the same stadium appear in multiple splits.
This happens because stadiums may intersect multiple rasters, resulting in different views of the same object being included in both training and validation sets. Additionally, overlapping rasters can lead to entire stadiums being duplicated across splits.
To prevent data leakage, we need to track which stadiums are contained within each raster and ensure that all rasters containing the same stadium belong to the same dataset split. This guarantees that the model is evaluated on truly unseen data, leading to more reliable performance metrics.

Tracking which rasters contain which objects is useful beyond dataset creation. After training a model, we might inspect all rasters containing a specific stadium for further analysis, debugging, or updates.
Bipartite Graphs to the rescue!
All key operations in dataset construction rely on tracking which vector features (stadiums) are contained in or intersect with which rasters (satellite images). A natural way to represent these relationships is through a bipartite graph.
What is a Bipartite Graph?
A bipartite graph consists of two distinct sets of nodes connected by edges. In our case:
One set of nodes represents vector features (stadiums).
The other represents rasters (satellite images).
Edges represent either containment (a stadium fully inside a raster) or intersection (a stadium partially overlapping a raster).

Each stadium connects only to rasters, and each raster only to stadiums - never to other rasters or stadiums. Since we track two types of relationships (containment and intersection), the edges can be visualized using different colors (e.g., blue and green in the figure below).
Using a bipartite graph allows one to circumvent the problems mentioned earlier. However, manually managing these relationships is cumbersome. The new GeoGrapher library automates this process, eliminating the need for manual management.
3. Introducing GeoGrapher
We’re excited to introduce GeoGrapher, an open-source Python library for building object-centric remote sensing ML datasets. It simplifies the process of transforming a set of known objects into a structured, labeled dataset ready for machine learning training and inference, while addressing key challenges related to object clustering such as artificial class imbalance and redundant data processing.
At its core, GeoGrapher maintains a bipartite graph that tracks containment and intersection relationships between vector features and rasters. This graph-driven approach helps minimize2 excessive clustering, prevent redundant downloads, optimize raster selection, and ensure objects are assigned correctly across dataset splits.
GeoGrapher is designed to be easy to use while offering the flexibility needed for more complex workflows.
Key features:
GeoGrapher simplifies dataset preparation and provides tools for efficient dataset management and analysis:
Downloading remote sensing imagery: Supports most remote sensing data sources via eodag, avoiding redundant downloads by tracking object coverage. It is easily extendable to new sources via a general downloader interface3.
Customizable cutting of datasets: Provides highly flexible raster-cutting functionality while using graph-based tracking to minimize clustering-related issues, such as class imbalance and unnecessary duplication.
Creating labels from vector features: Converts polygon vector features into segmentation masks for machine learning tasks.
Updating datasets without redundancy: Ensures efficient updates by checking existing raster coverage, preventing unnecessary downloads.
Preventing data leakage: Provides functionality to split datasets while ensuring that objects remain within a single partition (train/validate/test), preventing data leakage and ensuring proper model evaluation.
Graph-based querying: Allows users to retrieve all rasters that contain or intersect a given vector feature (e.g., a stadium) or find all vector features contained in or intersecting a specific raster. This enables flexible dataset inspection, debugging, and analysis.

Built for remote sensing ML workflows
By handling complex spatial relationships automatically, GeoGrapher allows remote sensing specialists and ML engineers to focus on model development rather than dataset wrangling. We’re excited to share GeoGrapher and hope it proves useful for your projects.
GeoGrapher in action
Want to see GeoGrapher in action? Check out our introductory notebook, where we show you how to use GeoGrapher to:
Download Sentinel-2 satellite images for stadiums.
Generate segmentation labels automatically.
Cut the raw images into an ML-ready dataset.
We look forward to seeing how you use GeoGrapher in your projects. Let us know your thoughts, and feel free to contribute or provide feedback!
1 While GeoGrapher supports arbitrary vector features in its bipartite graph, label generation for bounding boxes for object detection is not currently implemented. The challenge is that bounding boxes are typically defined relative to a fixed image grid, whereas satellite rasters are oriented based on the sensor's position and viewing angle at the time of capture. As a result, raster boundaries may not align with the drawn bounding boxes, making object detection workflows not straightforward.
2 The minimization is a simple heuristic which seems to work well in practice. An optimal minimization would involve a variant of the NP-complete geometric set cover problem and is intractable. In future releases we might improve the heuristic.
3 Processing requires additional processor implementations tailored to the datasource or producttype.