We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code is open-sourced under a permissive license.
We’re honored to present our work at one of the most prestigious conferences in computer vision. Stay tuned for more details!
TerraMind is the first any-to-any generative foundation model for Earth Observation, build by IBM, ESA Φ-lab, and the FAST-EO project. We pre-trained a tiny, small, base and a large version of TerraMind, all open-sourced on HuggingFace. The models are fully integrated into the fine-tuning toolkit TerraTorch, and we provide documentation for TerraMind here.
This repo presents code examples for fine-tuning TerraMind, using the Thinking-in-Modalities approach, and for any-to-any generations. We refer to Hugging Face and arXiv for more detailed information.

Download or clone this repo and create a new environment with the latest version of TerraTorch.
python -m venv venv # use python 3.10 or higher
source venv/bin/activate
pip install --upgrade pip
pip install terratorch==1.1
pip install jupyter gdown tensorboard # required for notebook examples
pip install diffusers==0.30.0 # required for TerraMind generations
You can verify the setup by running terratorch --help.
You can fine-tune TerraMind without any code using a Lightning config and TerraTorch:
terratorch fit -c <terramind_config.yaml>
For testing the fine-tuned TerraMind model, run:
terratorch test -c <terramind_config.yaml> --ckpt_path <path/to/your/checkpoint.ckpt>
We provide some config examples for Sen1Floods11, HLS Burn Scars, and Multitemporal Crop:
We use the GenericMultiModalDataModule in the Sen1Floods11 example and the standard GenericNonGeoSegmentationDataModule for the single-modal Burn Scars dataset.
We simplified the dataset folder structure compared to the original datasets. You can either adjust the paths in the config for the original datasets or download the updated version with the code in the notebooks.
The relevant parts of the config are explained in more detail in this notebook example:
If you plan to use TerraMind with multitemporal data, you can use the temporal wrapper provided by TerraTorch, see example:
We provide an unfinished notebook for HLS Burn Scars with several TODOs. This way, you can learn to adapt the config/notebook for new datasets.
TerraMind introduces a new Thinking-in-Modalities (TiM) approach, where other modalities are predicted as an intermediate steps.
Then, the fine-tuned encoder uses both raw inputs and the generated modalities.
You simply need to add the suffix _tim to the model name and optionally define the TiM modalities:
backbone: terramind_v1_small_tim
backbone_tim_modalities:
- LULC # default TiM modality
We share an example config for TiM fine-tuning here: terramind_v1_base_tim_lulc_sen1floods11.yaml. We refer to our paper for a more detailed explanation of the TiM approach.
TerraMind can perform any-to-any generation based on varying combinations of inputs. You can test the generation capabilities with this notebook: terramind_any_to_any_generation.ipynb.
If you are only interested in generating a single modality from another one, terramind_generation.ipynb (Open in Colab) provides a simplified version of the generation code. We provide some examples images from the TerraMesh validation split in examples/.
For larger tiles, you can used the tiled inference provided by TerraTorch which we demonstrate in large_tile_generation.ipynb (Open in Colab).
TerraMind uses six tokenizer for pre-training and generation. We provide some example code for using the tokenizer in terramind_tokenizer_reconstruction.ipynb.
Already working with TerraMind? Submit your use case to the TerraMind Blue-Sky Challenge, a bi-monthly award spotlighting the boldest, most imaginative ways using TerraMind.
If you use TerraMind in your research, please cite the TerraMind paper.
@article{jakubik2025terramind,
title={TerraMind: Large-Scale Generative Multimodality for Earth Observation},
author={Jakubik, Johannes and Yang, Felix and Blumenstiel, Benedikt and Scheurer, Erik and Sedona, Rocco and Maurogiovanni, Stefano and Bosmans, Jente and Dionelis, Nikolaos and Marsocci, Valerio and Kopp, Niklas and others},
journal={IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2025}
}
Notes: 312000000000000 FLOP/GPU/sec * 9216 GPU-hours * 3600 sec / hour * 0.3 [assumed utilization] = 3.10542336e+21 FLOP
Size Notes: "The model was pre-trained on 500B tokens from 9M spatiotemporally aligned multimodal samples from the TerraMesh dataset."