We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code is open-sourced under a permissive license.
FLOPs3.11e+21
Notes: 312000000000000 FLOP/GPU/sec * 9216 GPU-hours * 3600 sec / hour * 0.3 [assumed utilization] = 3.10542336e+21 FLOP
Training Code AccessibilityModels and code have been open-sourced at https://huggingface.co/ibm-esa-geospatial and https://github.com/ibm/terramind ("This repo presents code examples for fine-tuning TerraMind"). Apache 2.0 https://huggingface.co/ibm-esa-geospatial/TerraMind-1.0-base
HardwareNVIDIA A100
Hardware Quantity32
Size Notes: "The model was pre-trained on 500B tokens from 9M spatiotemporally aligned multimodal samples from the TerraMesh dataset."