File size: 12,208 Bytes
384ab53 dc6ae70 384ab53 3631173 dc6ae70 384ab53 dc6ae70 384ab53 dc6ae70 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
---
license: apache-2.0
datasets:
- Major-TOM/Core-S2L2A
- Major-TOM/Core-S2L1C
- Major-TOM/Core-S1RTC
tags:
- Earth Observation
- Foundation Model
- Remote Sensing
---
# TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation
<p align="center">
<img src="https://i.imgur.com/waxVImv.png" alt="Oryx TerraFM">
</p>
[](https://arxiv.org/abs/2506.06281)
[](https://github.com/mbzuai-oryx/TerraFM)
[](#🧠-model-zoo)
---
## 📢 Latest Updates
- **Jun-09-25**: 🚀 Initial release of **TerraFM codebase** and **pretrained models**
- **Jun-09-25**: 📄 Paper released on arXiv: [arxiv link](https://arxiv.org/abs/2506.06281). 🔥🔥
---
## 🌍 Overview
**TerraFM** is a scalable foundation model designed for unified processing of multisensor Earth Observation (EO) data. Built on a ViT backbone and trained over **18.7M tiles (~23T pixels)** from Sentinel-1 SAR and Sentinel-2 optical imagery, TerraFM unifies modality-specific inputs using:
- 🧩 Modality-specific patch embeddings
- 🌀 Adaptive cross-attention fusion
- 🎯 Dual-centering regularization for long-tailed distributions
TerraFM sets a new benchmark on **GEO-Bench** and **Copernicus-Bench**, demonstrating strong generalization across geographies, modalities, and tasks — including classification, segmentation, and landslide detection.
---
## 🔬 Key Features
<p align="center">
<img src="images/spider_gb.jpg" alt="TerraFM Architecture" width="500"/>
</p>
- **Multimodal Pretraining**: Uses Sentinel-1 (SAR) and Sentinel-2 (L1C, L2A) as natural augmentations.
- **Large-Scale Dataset**: Trained on 18.7M global tiles from the [Major-TOM](https://huggingface.co/Major-TOM) dataset.
- **Cross-Attention Fusion**: Dynamically aggregates information across sensors at patch level.
- **Dual-Centering**: Mitigates long-tailed land cover bias using ESA WorldCover statistics.
- **Benchmark SOTA**: Outperforms prior FMs (Galileo, Prithvi, DOFA) across multiple EO tasks.
---
## 🧱 Architecture
<p align="center">
<img src="images/arch.jpg" alt="TerraFM Architecture" width="700"/>
</p>
Overall architecture of TerraFM. It unifies student-teacher contrastive framework with modality augmentation with cross-attention fusion, and a new dual centering regularization. TerraFM is founded on ViT backbone and is trained on 18.7M globally distributed samples for pre-training and utilizes large-tile inputs for encoding broader spatial context. For illustration, RGB channels from S2-L2A and S2-L1C are selected, and S1 is visualized using a false-color RGB composite.
---
## 🧠 Model Zoo
| Model | Modality | Input Size | Backbone | Link |
|-------|----------|------------|--------|------|
| TerraFM-B | Sentinel-1 RTC + Sentinel-2 Level 2A + Sentinel-2 Level 1C | 224×224 | ViT-Base | [Download](https://huggingface.co/MBZUAI/TerraFM) |
| TerraFM-L | Sentinel-1 RTC + Sentinel-2 Level 2A + Sentinel-2 Level 1C | 224×224 | ViT-Large | [Download](https://huggingface.co/MBZUAI/TerraFM) |
---
## 🛠 Usage
TerraFM can be used directly via the `terrafm.py` module, which provides standalone implementations of the TerraFM-Base and TerraFM-Large models for easy integration into any codebase.
```python
from terrafm import terrafm_base, terrafm_large
import torch
# Simulated input: 1 sample, 12 channels, 224×224 resolution (e.g., Sentinel-2 L2A)
x = torch.randn(1, 12, 224, 224)
# Load TerraFM-Base model
model = terrafm_base()
# Load pretrained weights (e.g., TerraFM-B.pth)
state_dict = torch.load("TerraFM-B.pth", map_location="cpu")
msg = model.load_state_dict(state_dict, strict=False)
# Forward pass
y = model(x)
print(f"Output shape: {y.shape}")
```
---
## 📊 Results
### 🔍 k-NN Classification Results
We evaluate image classification using k-nearest neighbors (kNN) and report Top-1 accuracy for all single-label tasks. For the multilabel BigEarthNet benchmark, we report the F1 score.
| Model | Backbone | m-EuroSat (100%) | m-EuroSat (1%) | m-BigEarthNet (100%) | m-BigEarthNet (1%) | m-So2Sat (100%) | m-So2Sat (1%) | m-Brick-Kiln (100%) | m-Brick-Kiln (1%) |
|----------------|------------|------------------|----------------|------------------------|--------------------|------------------|----------------|----------------------|--------------------|
| SatMAE | ViT-Base | 84.1 | 34.8 | 50.6 | 29.0 | 36.0 | 23.1 | 86.1 | 73.5 |
| SatMAE++ | ViT-Large | 82.7 | 48.5 | 50.8 | 31.6 | 34.7 | 23.4 | 89.6 | 76.7 |
| CROMA | ViT-Base | 85.6 | 51.3 | 58.8 | 44.7 | 48.8 | 33.8 | 92.6 | 85.1 |
| SoftCon | ViT-Small | 89.8 | 27.2 | 64.7 | 43.3 | 51.1 | 31.4 | 89.2 | 77.8 |
| DOFA | ViT-Base | 82.8 | 49.6 | 49.4 | 29.9 | 41.4 | 29.4 | 88.3 | 78.3 |
| Satlas | Swin-Tiny | 81.7 | 35.8 | 51.9 | 29.6 | 36.6 | 27.1 | 88.2 | 73.0 |
| MMEarth | CNN-atto | 81.7 | 30.0 | 58.3 | 39.6 | 39.8 | 25.1 | 89.4 | 79.7 |
| DeCUR | ViT-Small | 89.0 | 46.6 | 63.8 | 49.6 | 45.8 | 30.9 | 83.7 | 74.2 |
| AnySat | ViT-Base | 82.2 | 47.1 | 54.9 | 33.7 | 39.8 | 29.0 | 85.3 | 72.0 |
| Galileo | ViT-Base | 93.0 | 56.6 | 59.0 | 36.5 | 54.8 | **43.2** | 90.7 | 78.0 |
| Prithvi-2.0 | ViT-Large | 80.2 | 48.0 | 49.4 | 28.8 | 29.5 | 26.1 | 87.9 | 80.6 |
| Copernicus-FM | ViT-Base | 76.0 | 47.4 | 53.8 | 33.3 | 38.4 | 23.3 | 93.0 | 83.2 |
| **TerraFM** | ViT-Base | _94.2_ | _59.3_ | _68.7_ | 49.4 | _55.1_ | _41.6_ | **94.5** | **85.6** |
|**TerraFM**| ViT-Large | **95.1** | **62.1** | **69.4** | **50.6** | **55.9** | 41.1 | _93.0_ | 82.2 |
### 🛰 Copernicus-Bench
Comparison of TerraFM with existing supervised and self-supervised methods on **Copernicus-Bench**.
Metrics include **OA** (Overall Accuracy), **mAP** (mean Average Precision), and **mIoU** (mean Intersection over Union).
| Dataset | Metric | Supervised | Random | SoftCon | CROMA | DOFA | Copernicus-FM | **TerraFM** |
|----------------|--------|------------|--------|---------|--------|------|----------------|-------------|
| **Backbone** | -- | ViT-B/16 | ViT-B/16 | ViT-B/14 | ViT-B/8 | ViT-B/16 | ViT-B/16 | ViT-B/16 |
| **Cloud-S2** | mIoU | 59.4 | 60.4 | 66.9 | 65.0 | 65.0 | 66.7 | **67.9** |
| **EuroSAT-S1** | OA | 81.5 | 75.4 | 83.6 | 83.9 | 81.7 | 87.2 | **87.8** |
| **EuroSAT-S2** | OA | 97.6 | 92.5 | 96.7 | 97.0 | 97.2 | 97.9 | **99.1** |
| **BigEarthNet-S1** | mAP | 70.6 | 63.8 | **78.7**| 70.8 | 70.5 | 77.9 | 76.9 |
| **BigEarthNet-S2** | mAP | 80.1 | 71.6 | 83.6 | 76.4 | 75.5 | 79.0 | **84.4** |
| **DFC2020-S1** | mIoU | 50.8 | 45.4 | 52.8 | 52.7 | 49.7 | 52.4 | **55.4** |
| **DFC2020-S2** | mIoU | 66.2 | 62.3 | 64.1 | **66.5**| 61.8 | 64.5 | 63.8 |
| **LCZ-S2** | OA | 85.3 | 77.4 | 83.6 | 84.1 | 83.0 | 84.4 | **87.0** |
### 🧪 GEO-Bench Performance
Performance comparison on GEO-Bench for both **classification** (Top-1 Accuracy), **segmentation** (mIoU), and **F1 score** (for m-BigEarthNet).
TerraFM achieves state-of-the-art results across multiple datasets, outperforming previous foundation models.
| Method | Backbone | m-EuroSat | m-BigEarthNet | m-So2Sat | m-Brick-Kiln | m-Cashew-Plant | m-SA-Crop-Type |
|--------------|------------|-----------|----------------|----------|----------------|------------------|------------------|
| SatMAE | ViT-Large | 96.6 | 68.3 | 57.2 | 98.4 | 30.8 | 24.8 |
| SatMAE++ | ViT-Large | 96.5 | 67.9 | 56.0 | 98.6 | 29.6 | 25.7 |
| CROMA | ViT-Large | 96.6 | 71.9 | 60.6 | 98.7 | 31.8 | 32.0 |
| SoftCon | ViT-Base | 97.5 | 70.3 | 61.7 | 98.7 | 29.6 | 30.8 |
| DOFA | ViT-Large | 96.9 | 68.0 | 58.7 | 98.6 | 27.7 | 25.4 |
| Satlas | Swin-Base | 97.5 | 72.8 | 61.9 | **98.9** | 25.1 | 23.4 |
| MMEarth | CNN-atto | 95.7 | 70.0 | 57.2 | 98.9 | 24.2 | 22.2 |
| DeCUR | ViT-Small | 97.9 | 70.9 | 61.7 | 98.7 | 26.2 | 21.5 |
| Prithvi 2.0 | ViT-Large | 96.5 | 69.0 | 54.6 | 98.6 | 26.7 | 22.9 |
| AnySat | ViT-Base | 95.9 | 70.3 | 51.8 | 98.6 | 26.1 | 27.1 |
| Galileo | ViT-Base | 97.7 | 70.7 | 63.3 | 98.7 | 33.0 | 30.1 |
| **TerraFM** | ViT-Base | *98.1* | 72.6 | *64.9* | 98.7 | *34.1* | *33.0* |
| **TerraFM** | ViT-Large | **98.6** | **73.1** | **66.6** | **99.0** | **37.2** | **34.5** |
### 🌋 Landslide Detection (Landslide4Sense)
Landslide detection performance on the **Landslide4Sense** test set.
Despite having significantly fewer parameters (120M vs. 300M), **TerraFM** achieves higher overall segmentation performance, especially for landslide regions.
| Model | mIoU | IoU (Landslide) |
|------------------------|------|-----------------|
| Prithvi-EO-2.0 (300M) | 65.0 | 31.5 |
| **TerraFM (120M)** | **70.8** | **43.1** |
<p align="center">
<img src="images/ls4s_qual.jpg" alt="Landslide Detection" width="700"/>
</p>
---
## 📜 Citation
If you find our work and this repository useful, please consider giving our repo a star and citing our paper as follows:
```bibtex
@article{danish2025terrafmscalablefoundationmodel,
title={TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation},
author={Muhammad Sohail Danish and Muhammad Akhtar Munir and Syed Roshaan Ali Shah and Muhammad Haris Khan and Rao Muhammad Anwer and Jorma Laaksonen and Fahad Shahbaz Khan and Salman Khan},
year={2025},
eprint={2506.06281},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.06281},
}
```
## 📨 Contact
If you have any questions, please create an issue on this repository or contact at [email protected]. |