TerraFM / README.md

Update README.md

3631173 verified 4 months ago

12.2 kB

	---
	license: apache-2.0
	datasets:
	- Major-TOM/Core-S2L2A
	- Major-TOM/Core-S2L1C
	- Major-TOM/Core-S1RTC
	tags:
	- Earth Observation
	- Foundation Model
	- Remote Sensing
	---
	# TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation
	<p align="center">
	<img src="https://i.imgur.com/waxVImv.png" alt="Oryx TerraFM">
	</p>

	[![paper](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2506.06281)
	[![code](https://img.shields.io/badge/GitHub-Code-blue.svg)](https://github.com/mbzuai-oryx/TerraFM)
	[![Model Zoo](https://img.shields.io/badge/Model%20Zoo-HuggingFace-blue)](#🧠-model-zoo)

	---

	## 📢 Latest Updates
	- Jun-09-25: 🚀 Initial release of TerraFM codebase and pretrained models
	- Jun-09-25: 📄 Paper released on arXiv: [arxiv link](https://arxiv.org/abs/2506.06281). 🔥🔥

	---

	## 🌍 Overview

	TerraFM is a scalable foundation model designed for unified processing of multisensor Earth Observation (EO) data. Built on a ViT backbone and trained over 18.7M tiles (~23T pixels) from Sentinel-1 SAR and Sentinel-2 optical imagery, TerraFM unifies modality-specific inputs using:

	- 🧩 Modality-specific patch embeddings
	- 🌀 Adaptive cross-attention fusion
	- 🎯 Dual-centering regularization for long-tailed distributions

	TerraFM sets a new benchmark on GEO-Bench and Copernicus-Bench, demonstrating strong generalization across geographies, modalities, and tasks — including classification, segmentation, and landslide detection.

	---


	## 🔬 Key Features

	<p align="center">
	<img src="images/spider_gb.jpg" alt="TerraFM Architecture" width="500"/>
	</p>

	- Multimodal Pretraining: Uses Sentinel-1 (SAR) and Sentinel-2 (L1C, L2A) as natural augmentations.
	- Large-Scale Dataset: Trained on 18.7M global tiles from the [Major-TOM](https://huggingface.co/Major-TOM) dataset.
	- Cross-Attention Fusion: Dynamically aggregates information across sensors at patch level.
	- Dual-Centering: Mitigates long-tailed land cover bias using ESA WorldCover statistics.
	- Benchmark SOTA: Outperforms prior FMs (Galileo, Prithvi, DOFA) across multiple EO tasks.

	---
	## 🧱 Architecture

	<p align="center">
	<img src="images/arch.jpg" alt="TerraFM Architecture" width="700"/>
	</p>

	Overall architecture of TerraFM. It unifies student-teacher contrastive framework with modality augmentation with cross-attention fusion, and a new dual centering regularization. TerraFM is founded on ViT backbone and is trained on 18.7M globally distributed samples for pre-training and utilizes large-tile inputs for encoding broader spatial context. For illustration, RGB channels from S2-L2A and S2-L1C are selected, and S1 is visualized using a false-color RGB composite.

	---
	## 🧠 Model Zoo

	\| Model \| Modality \| Input Size \| Backbone \| Link \|
	\|-------\|----------\|------------\|--------\|------\|
	\| TerraFM-B \| Sentinel-1 RTC + Sentinel-2 Level 2A + Sentinel-2 Level 1C \| 224×224 \| ViT-Base \| [Download](https://huggingface.co/MBZUAI/TerraFM) \|
	\| TerraFM-L \| Sentinel-1 RTC + Sentinel-2 Level 2A + Sentinel-2 Level 1C \| 224×224 \| ViT-Large \| [Download](https://huggingface.co/MBZUAI/TerraFM) \|

	---

	## 🛠 Usage

	TerraFM can be used directly via the `terrafm.py` module, which provides standalone implementations of the TerraFM-Base and TerraFM-Large models for easy integration into any codebase.

	```python
	from terrafm import terrafm_base, terrafm_large
	import torch

	# Simulated input: 1 sample, 12 channels, 224×224 resolution (e.g., Sentinel-2 L2A)
	x = torch.randn(1, 12, 224, 224)

	# Load TerraFM-Base model
	model = terrafm_base()

	# Load pretrained weights (e.g., TerraFM-B.pth)
	state_dict = torch.load("TerraFM-B.pth", map_location="cpu")
	msg = model.load_state_dict(state_dict, strict=False)

	# Forward pass
	y = model(x)
	print(f"Output shape: {y.shape}")
	```
	---


	## 📊 Results

	### 🔍 k-NN Classification Results

	We evaluate image classification using k-nearest neighbors (kNN) and report Top-1 accuracy for all single-label tasks. For the multilabel BigEarthNet benchmark, we report the F1 score.

	\| Model \| Backbone \| m-EuroSat (100%) \| m-EuroSat (1%) \| m-BigEarthNet (100%) \| m-BigEarthNet (1%) \| m-So2Sat (100%) \| m-So2Sat (1%) \| m-Brick-Kiln (100%) \| m-Brick-Kiln (1%) \|
	\|----------------\|------------\|------------------\|----------------\|------------------------\|--------------------\|------------------\|----------------\|----------------------\|--------------------\|
	\| SatMAE \| ViT-Base \| 84.1 \| 34.8 \| 50.6 \| 29.0 \| 36.0 \| 23.1 \| 86.1 \| 73.5 \|
	\| SatMAE++ \| ViT-Large \| 82.7 \| 48.5 \| 50.8 \| 31.6 \| 34.7 \| 23.4 \| 89.6 \| 76.7 \|
	\| CROMA \| ViT-Base \| 85.6 \| 51.3 \| 58.8 \| 44.7 \| 48.8 \| 33.8 \| 92.6 \| 85.1 \|
	\| SoftCon \| ViT-Small \| 89.8 \| 27.2 \| 64.7 \| 43.3 \| 51.1 \| 31.4 \| 89.2 \| 77.8 \|
	\| DOFA \| ViT-Base \| 82.8 \| 49.6 \| 49.4 \| 29.9 \| 41.4 \| 29.4 \| 88.3 \| 78.3 \|
	\| Satlas \| Swin-Tiny \| 81.7 \| 35.8 \| 51.9 \| 29.6 \| 36.6 \| 27.1 \| 88.2 \| 73.0 \|
	\| MMEarth \| CNN-atto \| 81.7 \| 30.0 \| 58.3 \| 39.6 \| 39.8 \| 25.1 \| 89.4 \| 79.7 \|
	\| DeCUR \| ViT-Small \| 89.0 \| 46.6 \| 63.8 \| 49.6 \| 45.8 \| 30.9 \| 83.7 \| 74.2 \|
	\| AnySat \| ViT-Base \| 82.2 \| 47.1 \| 54.9 \| 33.7 \| 39.8 \| 29.0 \| 85.3 \| 72.0 \|
	\| Galileo \| ViT-Base \| 93.0 \| 56.6 \| 59.0 \| 36.5 \| 54.8 \| 43.2 \| 90.7 \| 78.0 \|
	\| Prithvi-2.0 \| ViT-Large \| 80.2 \| 48.0 \| 49.4 \| 28.8 \| 29.5 \| 26.1 \| 87.9 \| 80.6 \|
	\| Copernicus-FM \| ViT-Base \| 76.0 \| 47.4 \| 53.8 \| 33.3 \| 38.4 \| 23.3 \| 93.0 \| 83.2 \|
	\| TerraFM \| ViT-Base \| _94.2_ \| _59.3_ \| _68.7_ \| 49.4 \| _55.1_ \| _41.6_ \| 94.5 \| 85.6 \|
	\|TerraFM\| ViT-Large \| 95.1 \| 62.1 \| 69.4 \| 50.6 \| 55.9 \| 41.1 \| _93.0_ \| 82.2 \|


	### 🛰 Copernicus-Bench

	Comparison of TerraFM with existing supervised and self-supervised methods on Copernicus-Bench.
	Metrics include OA (Overall Accuracy), mAP (mean Average Precision), and mIoU (mean Intersection over Union).

	\| Dataset \| Metric \| Supervised \| Random \| SoftCon \| CROMA \| DOFA \| Copernicus-FM \| TerraFM \|
	\|----------------\|--------\|------------\|--------\|---------\|--------\|------\|----------------\|-------------\|
	\| Backbone \| -- \| ViT-B/16 \| ViT-B/16 \| ViT-B/14 \| ViT-B/8 \| ViT-B/16 \| ViT-B/16 \| ViT-B/16 \|
	\| Cloud-S2 \| mIoU \| 59.4 \| 60.4 \| 66.9 \| 65.0 \| 65.0 \| 66.7 \| 67.9 \|
	\| EuroSAT-S1 \| OA \| 81.5 \| 75.4 \| 83.6 \| 83.9 \| 81.7 \| 87.2 \| 87.8 \|
	\| EuroSAT-S2 \| OA \| 97.6 \| 92.5 \| 96.7 \| 97.0 \| 97.2 \| 97.9 \| 99.1 \|
	\| BigEarthNet-S1 \| mAP \| 70.6 \| 63.8 \| 78.7\| 70.8 \| 70.5 \| 77.9 \| 76.9 \|
	\| BigEarthNet-S2 \| mAP \| 80.1 \| 71.6 \| 83.6 \| 76.4 \| 75.5 \| 79.0 \| 84.4 \|
	\| DFC2020-S1 \| mIoU \| 50.8 \| 45.4 \| 52.8 \| 52.7 \| 49.7 \| 52.4 \| 55.4 \|
	\| DFC2020-S2 \| mIoU \| 66.2 \| 62.3 \| 64.1 \| 66.5\| 61.8 \| 64.5 \| 63.8 \|
	\| LCZ-S2 \| OA \| 85.3 \| 77.4 \| 83.6 \| 84.1 \| 83.0 \| 84.4 \| 87.0 \|

	### 🧪 GEO-Bench Performance

	Performance comparison on GEO-Bench for both classification (Top-1 Accuracy), segmentation (mIoU), and F1 score (for m-BigEarthNet).
	TerraFM achieves state-of-the-art results across multiple datasets, outperforming previous foundation models.

	\| Method \| Backbone \| m-EuroSat \| m-BigEarthNet \| m-So2Sat \| m-Brick-Kiln \| m-Cashew-Plant \| m-SA-Crop-Type \|
	\|--------------\|------------\|-----------\|----------------\|----------\|----------------\|------------------\|------------------\|
	\| SatMAE \| ViT-Large \| 96.6 \| 68.3 \| 57.2 \| 98.4 \| 30.8 \| 24.8 \|
	\| SatMAE++ \| ViT-Large \| 96.5 \| 67.9 \| 56.0 \| 98.6 \| 29.6 \| 25.7 \|
	\| CROMA \| ViT-Large \| 96.6 \| 71.9 \| 60.6 \| 98.7 \| 31.8 \| 32.0 \|
	\| SoftCon \| ViT-Base \| 97.5 \| 70.3 \| 61.7 \| 98.7 \| 29.6 \| 30.8 \|
	\| DOFA \| ViT-Large \| 96.9 \| 68.0 \| 58.7 \| 98.6 \| 27.7 \| 25.4 \|
	\| Satlas \| Swin-Base \| 97.5 \| 72.8 \| 61.9 \| 98.9 \| 25.1 \| 23.4 \|
	\| MMEarth \| CNN-atto \| 95.7 \| 70.0 \| 57.2 \| 98.9 \| 24.2 \| 22.2 \|
	\| DeCUR \| ViT-Small \| 97.9 \| 70.9 \| 61.7 \| 98.7 \| 26.2 \| 21.5 \|
	\| Prithvi 2.0 \| ViT-Large \| 96.5 \| 69.0 \| 54.6 \| 98.6 \| 26.7 \| 22.9 \|
	\| AnySat \| ViT-Base \| 95.9 \| 70.3 \| 51.8 \| 98.6 \| 26.1 \| 27.1 \|
	\| Galileo \| ViT-Base \| 97.7 \| 70.7 \| 63.3 \| 98.7 \| 33.0 \| 30.1 \|
	\| TerraFM \| ViT-Base \| 98.1 \| 72.6 \| 64.9 \| 98.7 \| 34.1 \| 33.0 \|
	\| TerraFM \| ViT-Large \| 98.6 \| 73.1 \| 66.6 \| 99.0 \| 37.2 \| 34.5 \|


	### 🌋 Landslide Detection (Landslide4Sense)

	Landslide detection performance on the Landslide4Sense test set.
	Despite having significantly fewer parameters (120M vs. 300M), TerraFM achieves higher overall segmentation performance, especially for landslide regions.
	\| Model \| mIoU \| IoU (Landslide) \|
	\|------------------------\|------\|-----------------\|
	\| Prithvi-EO-2.0 (300M) \| 65.0 \| 31.5 \|
	\| TerraFM (120M) \| 70.8 \| 43.1 \|

	<p align="center">
	<img src="images/ls4s_qual.jpg" alt="Landslide Detection" width="700"/>
	</p>
	---

	## 📜 Citation
	If you find our work and this repository useful, please consider giving our repo a star and citing our paper as follows:
	```bibtex
	@article{danish2025terrafmscalablefoundationmodel,
	title={TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation},
	author={Muhammad Sohail Danish and Muhammad Akhtar Munir and Syed Roshaan Ali Shah and Muhammad Haris Khan and Rao Muhammad Anwer and Jorma Laaksonen and Fahad Shahbaz Khan and Salman Khan},
	year={2025},
	eprint={2506.06281},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2506.06281},
	}
	```




	## 📨 Contact
	If you have any questions, please create an issue on this repository or contact at [email protected].