|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Major-TOM/Core-S2L2A |
|
- Major-TOM/Core-S2L1C |
|
- Major-TOM/Core-S1RTC |
|
tags: |
|
- Earth Observation |
|
- Foundation Model |
|
- Remote Sensing |
|
--- |
|
# TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation |
|
<p align="center"> |
|
<img src="https://i.imgur.com/waxVImv.png" alt="Oryx TerraFM"> |
|
</p> |
|
|
|
[](https://arxiv.org/abs/2506.06281) |
|
[](https://github.com/mbzuai-oryx/TerraFM) |
|
[](#🧠-model-zoo) |
|
|
|
--- |
|
|
|
## 📢 Latest Updates |
|
- **Jun-09-25**: 🚀 Initial release of **TerraFM codebase** and **pretrained models** |
|
- **Jun-09-25**: 📄 Paper released on arXiv: [arxiv link](https://arxiv.org/abs/2506.06281). 🔥🔥 |
|
|
|
--- |
|
|
|
## 🌍 Overview |
|
|
|
**TerraFM** is a scalable foundation model designed for unified processing of multisensor Earth Observation (EO) data. Built on a ViT backbone and trained over **18.7M tiles (~23T pixels)** from Sentinel-1 SAR and Sentinel-2 optical imagery, TerraFM unifies modality-specific inputs using: |
|
|
|
- 🧩 Modality-specific patch embeddings |
|
- 🌀 Adaptive cross-attention fusion |
|
- 🎯 Dual-centering regularization for long-tailed distributions |
|
|
|
TerraFM sets a new benchmark on **GEO-Bench** and **Copernicus-Bench**, demonstrating strong generalization across geographies, modalities, and tasks — including classification, segmentation, and landslide detection. |
|
|
|
--- |
|
|
|
|
|
## 🔬 Key Features |
|
|
|
<p align="center"> |
|
<img src="images/spider_gb.jpg" alt="TerraFM Architecture" width="500"/> |
|
</p> |
|
|
|
- **Multimodal Pretraining**: Uses Sentinel-1 (SAR) and Sentinel-2 (L1C, L2A) as natural augmentations. |
|
- **Large-Scale Dataset**: Trained on 18.7M global tiles from the [Major-TOM](https://huggingface.co/Major-TOM) dataset. |
|
- **Cross-Attention Fusion**: Dynamically aggregates information across sensors at patch level. |
|
- **Dual-Centering**: Mitigates long-tailed land cover bias using ESA WorldCover statistics. |
|
- **Benchmark SOTA**: Outperforms prior FMs (Galileo, Prithvi, DOFA) across multiple EO tasks. |
|
|
|
--- |
|
## 🧱 Architecture |
|
|
|
<p align="center"> |
|
<img src="images/arch.jpg" alt="TerraFM Architecture" width="700"/> |
|
</p> |
|
|
|
Overall architecture of TerraFM. It unifies student-teacher contrastive framework with modality augmentation with cross-attention fusion, and a new dual centering regularization. TerraFM is founded on ViT backbone and is trained on 18.7M globally distributed samples for pre-training and utilizes large-tile inputs for encoding broader spatial context. For illustration, RGB channels from S2-L2A and S2-L1C are selected, and S1 is visualized using a false-color RGB composite. |
|
|
|
--- |
|
## 🧠 Model Zoo |
|
|
|
| Model | Modality | Input Size | Backbone | Link | |
|
|-------|----------|------------|--------|------| |
|
| TerraFM-B | Sentinel-1 RTC + Sentinel-2 Level 2A + Sentinel-2 Level 1C | 224×224 | ViT-Base | [Download](https://huggingface.co/MBZUAI/TerraFM) | |
|
| TerraFM-L | Sentinel-1 RTC + Sentinel-2 Level 2A + Sentinel-2 Level 1C | 224×224 | ViT-Large | [Download](https://huggingface.co/MBZUAI/TerraFM) | |
|
|
|
--- |
|
|
|
## 🛠 Usage |
|
|
|
TerraFM can be used directly via the `terrafm.py` module, which provides standalone implementations of the TerraFM-Base and TerraFM-Large models for easy integration into any codebase. |
|
|
|
```python |
|
from terrafm import terrafm_base, terrafm_large |
|
import torch |
|
|
|
# Simulated input: 1 sample, 12 channels, 224×224 resolution (e.g., Sentinel-2 L2A) |
|
x = torch.randn(1, 12, 224, 224) |
|
|
|
# Load TerraFM-Base model |
|
model = terrafm_base() |
|
|
|
# Load pretrained weights (e.g., TerraFM-B.pth) |
|
state_dict = torch.load("TerraFM-B.pth", map_location="cpu") |
|
msg = model.load_state_dict(state_dict, strict=False) |
|
|
|
# Forward pass |
|
y = model(x) |
|
print(f"Output shape: {y.shape}") |
|
``` |
|
--- |
|
|
|
|
|
## 📊 Results |
|
|
|
### 🔍 k-NN Classification Results |
|
|
|
We evaluate image classification using k-nearest neighbors (kNN) and report Top-1 accuracy for all single-label tasks. For the multilabel BigEarthNet benchmark, we report the F1 score. |
|
|
|
| Model | Backbone | m-EuroSat (100%) | m-EuroSat (1%) | m-BigEarthNet (100%) | m-BigEarthNet (1%) | m-So2Sat (100%) | m-So2Sat (1%) | m-Brick-Kiln (100%) | m-Brick-Kiln (1%) | |
|
|----------------|------------|------------------|----------------|------------------------|--------------------|------------------|----------------|----------------------|--------------------| |
|
| SatMAE | ViT-Base | 84.1 | 34.8 | 50.6 | 29.0 | 36.0 | 23.1 | 86.1 | 73.5 | |
|
| SatMAE++ | ViT-Large | 82.7 | 48.5 | 50.8 | 31.6 | 34.7 | 23.4 | 89.6 | 76.7 | |
|
| CROMA | ViT-Base | 85.6 | 51.3 | 58.8 | 44.7 | 48.8 | 33.8 | 92.6 | 85.1 | |
|
| SoftCon | ViT-Small | 89.8 | 27.2 | 64.7 | 43.3 | 51.1 | 31.4 | 89.2 | 77.8 | |
|
| DOFA | ViT-Base | 82.8 | 49.6 | 49.4 | 29.9 | 41.4 | 29.4 | 88.3 | 78.3 | |
|
| Satlas | Swin-Tiny | 81.7 | 35.8 | 51.9 | 29.6 | 36.6 | 27.1 | 88.2 | 73.0 | |
|
| MMEarth | CNN-atto | 81.7 | 30.0 | 58.3 | 39.6 | 39.8 | 25.1 | 89.4 | 79.7 | |
|
| DeCUR | ViT-Small | 89.0 | 46.6 | 63.8 | 49.6 | 45.8 | 30.9 | 83.7 | 74.2 | |
|
| AnySat | ViT-Base | 82.2 | 47.1 | 54.9 | 33.7 | 39.8 | 29.0 | 85.3 | 72.0 | |
|
| Galileo | ViT-Base | 93.0 | 56.6 | 59.0 | 36.5 | 54.8 | **43.2** | 90.7 | 78.0 | |
|
| Prithvi-2.0 | ViT-Large | 80.2 | 48.0 | 49.4 | 28.8 | 29.5 | 26.1 | 87.9 | 80.6 | |
|
| Copernicus-FM | ViT-Base | 76.0 | 47.4 | 53.8 | 33.3 | 38.4 | 23.3 | 93.0 | 83.2 | |
|
| **TerraFM** | ViT-Base | _94.2_ | _59.3_ | _68.7_ | 49.4 | _55.1_ | _41.6_ | **94.5** | **85.6** | |
|
|**TerraFM**| ViT-Large | **95.1** | **62.1** | **69.4** | **50.6** | **55.9** | 41.1 | _93.0_ | 82.2 | |
|
|
|
|
|
### 🛰 Copernicus-Bench |
|
|
|
Comparison of TerraFM with existing supervised and self-supervised methods on **Copernicus-Bench**. |
|
Metrics include **OA** (Overall Accuracy), **mAP** (mean Average Precision), and **mIoU** (mean Intersection over Union). |
|
|
|
| Dataset | Metric | Supervised | Random | SoftCon | CROMA | DOFA | Copernicus-FM | **TerraFM** | |
|
|----------------|--------|------------|--------|---------|--------|------|----------------|-------------| |
|
| **Backbone** | -- | ViT-B/16 | ViT-B/16 | ViT-B/14 | ViT-B/8 | ViT-B/16 | ViT-B/16 | ViT-B/16 | |
|
| **Cloud-S2** | mIoU | 59.4 | 60.4 | 66.9 | 65.0 | 65.0 | 66.7 | **67.9** | |
|
| **EuroSAT-S1** | OA | 81.5 | 75.4 | 83.6 | 83.9 | 81.7 | 87.2 | **87.8** | |
|
| **EuroSAT-S2** | OA | 97.6 | 92.5 | 96.7 | 97.0 | 97.2 | 97.9 | **99.1** | |
|
| **BigEarthNet-S1** | mAP | 70.6 | 63.8 | **78.7**| 70.8 | 70.5 | 77.9 | 76.9 | |
|
| **BigEarthNet-S2** | mAP | 80.1 | 71.6 | 83.6 | 76.4 | 75.5 | 79.0 | **84.4** | |
|
| **DFC2020-S1** | mIoU | 50.8 | 45.4 | 52.8 | 52.7 | 49.7 | 52.4 | **55.4** | |
|
| **DFC2020-S2** | mIoU | 66.2 | 62.3 | 64.1 | **66.5**| 61.8 | 64.5 | 63.8 | |
|
| **LCZ-S2** | OA | 85.3 | 77.4 | 83.6 | 84.1 | 83.0 | 84.4 | **87.0** | |
|
|
|
### 🧪 GEO-Bench Performance |
|
|
|
Performance comparison on GEO-Bench for both **classification** (Top-1 Accuracy), **segmentation** (mIoU), and **F1 score** (for m-BigEarthNet). |
|
TerraFM achieves state-of-the-art results across multiple datasets, outperforming previous foundation models. |
|
|
|
| Method | Backbone | m-EuroSat | m-BigEarthNet | m-So2Sat | m-Brick-Kiln | m-Cashew-Plant | m-SA-Crop-Type | |
|
|--------------|------------|-----------|----------------|----------|----------------|------------------|------------------| |
|
| SatMAE | ViT-Large | 96.6 | 68.3 | 57.2 | 98.4 | 30.8 | 24.8 | |
|
| SatMAE++ | ViT-Large | 96.5 | 67.9 | 56.0 | 98.6 | 29.6 | 25.7 | |
|
| CROMA | ViT-Large | 96.6 | 71.9 | 60.6 | 98.7 | 31.8 | 32.0 | |
|
| SoftCon | ViT-Base | 97.5 | 70.3 | 61.7 | 98.7 | 29.6 | 30.8 | |
|
| DOFA | ViT-Large | 96.9 | 68.0 | 58.7 | 98.6 | 27.7 | 25.4 | |
|
| Satlas | Swin-Base | 97.5 | 72.8 | 61.9 | **98.9** | 25.1 | 23.4 | |
|
| MMEarth | CNN-atto | 95.7 | 70.0 | 57.2 | 98.9 | 24.2 | 22.2 | |
|
| DeCUR | ViT-Small | 97.9 | 70.9 | 61.7 | 98.7 | 26.2 | 21.5 | |
|
| Prithvi 2.0 | ViT-Large | 96.5 | 69.0 | 54.6 | 98.6 | 26.7 | 22.9 | |
|
| AnySat | ViT-Base | 95.9 | 70.3 | 51.8 | 98.6 | 26.1 | 27.1 | |
|
| Galileo | ViT-Base | 97.7 | 70.7 | 63.3 | 98.7 | 33.0 | 30.1 | |
|
| **TerraFM** | ViT-Base | *98.1* | 72.6 | *64.9* | 98.7 | *34.1* | *33.0* | |
|
| **TerraFM** | ViT-Large | **98.6** | **73.1** | **66.6** | **99.0** | **37.2** | **34.5** | |
|
|
|
|
|
### 🌋 Landslide Detection (Landslide4Sense) |
|
|
|
Landslide detection performance on the **Landslide4Sense** test set. |
|
Despite having significantly fewer parameters (120M vs. 300M), **TerraFM** achieves higher overall segmentation performance, especially for landslide regions. |
|
| Model | mIoU | IoU (Landslide) | |
|
|------------------------|------|-----------------| |
|
| Prithvi-EO-2.0 (300M) | 65.0 | 31.5 | |
|
| **TerraFM (120M)** | **70.8** | **43.1** | |
|
|
|
<p align="center"> |
|
<img src="images/ls4s_qual.jpg" alt="Landslide Detection" width="700"/> |
|
</p> |
|
--- |
|
|
|
## 📜 Citation |
|
If you find our work and this repository useful, please consider giving our repo a star and citing our paper as follows: |
|
```bibtex |
|
@article{danish2025terrafmscalablefoundationmodel, |
|
title={TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation}, |
|
author={Muhammad Sohail Danish and Muhammad Akhtar Munir and Syed Roshaan Ali Shah and Muhammad Haris Khan and Rao Muhammad Anwer and Jorma Laaksonen and Fahad Shahbaz Khan and Salman Khan}, |
|
year={2025}, |
|
eprint={2506.06281}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2506.06281}, |
|
} |
|
``` |
|
|
|
|
|
|
|
|
|
## 📨 Contact |
|
If you have any questions, please create an issue on this repository or contact at [email protected]. |