|
--- |
|
library_name: transformers |
|
tags: |
|
- keypoint-matching |
|
license: apache-2.0 |
|
--- |
|
|
|
# MatchAnything-ELOFTR |
|
|
|
The MatchAnything-ELOFTR model was proposed in **"MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training"** by Xingyi He, Hao Yu, Sida Peng, Dongli Tan, Zehong Shen, Hujun Bao, and Xiaowei Zhou from Zhejiang University and Shandong University. |
|
|
|
This model is a version of **ELOFTR** enhanced by the MatchAnything pre-training framework. This framework enables the model to achieve universal cross-modality image matching capabilities, overcoming the significant challenge of matching images with drastic appearance changes due to different imaging principles (e.g., thermal vs. visible, CT vs. MRI). This is achieved by pre-training on a massive, diverse dataset synthesized with cross-modal stimulus signals, teaching the model to recognize fundamental, appearance-insensitive structures. |
|
|
|
The abstract from the paper is the following: |
|
|
|
"Image matching, which aims to identify corresponding pixel locations between images, is crucial in a wide range of scientific disciplines, aiding in image registration, fusion, and analysis. In recent years, deep learning-based image matching algorithms have dramatically outperformed humans in rapidly and accurately finding large amounts of correspondences. However, when dealing with images captured under different imaging modalities that result in significant appearance changes, the performance of these algorithms often deteriorates due to the scarcity of annotated cross-modal training data. This limitation hinders applications in various fields that rely on multiple image modalities to obtain complementary information. To address this challenge, we propose a large-scale pre-training framework that utilizes synthetic cross-modal training signals, incorporating diverse data from various sources, to train models to recognize and match fundamental structures across images. This capability is transferable to real-world, unseen cross-modality image matching tasks. Our key finding is that the matching model trained with our framework achieves remarkable generalizability across more than eight unseen cross-modality registration tasks using the same network weight, substantially outperforming existing methods, whether designed for generalization or tailored for specific tasks. This advancement significantly enhances the applicability of image matching technologies across various scientific disciplines and paves the way for new applications in multi-modality human and artificial intelligence (AI) analysis and beyond." |
|
|
|
|
|
 |
|
|
|
This model was contributed by [stevenbucaille](https://huggingface.co/stevenbucaille). |
|
The original code for the MatchAnything project can be found [here](https://github.com/zju3dv/MatchAnything). |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
**MatchAnything-ELOFTR** is a semi-dense feature matcher that has been pre-trained using the novel MatchAnything framework to give it powerful generalization capabilities for cross-modality tasks. The core innovations stem from the training framework, not the model architecture itself, which remains that of ELOFTR. |
|
|
|
The key innovations of the MatchAnything framework include: |
|
- A **multi-resource dataset mixture training engine** that combines various data sources to ensure diversity. This includes multi-view images with 3D reconstructions, large-scale unlabelled video sequences, and vast single-image datasets. |
|
- A **cross-modality stimulus data generator** that uses image generation techniques (like style transfer and depth estimation) to create synthetic, pixel-aligned cross-modal training pairs (e.g., visible-to-thermal, visible-to-depth). |
|
- This process trains the model to learn **appearance-insensitive, fundamental image structures**, allowing a single set of model weights to perform robustly on over eight different and completely unseen cross-modal matching tasks. |
|
|
|
- **Developed by:** ZJU3DV at Zhejiang University & Shandong University |
|
- **Model type:** Image Matching |
|
- **License:** Apache 2.0 |
|
|
|
### Model Sources |
|
|
|
- **Repository:** https://github.com/zju3dv/MatchAnything |
|
- **Project page:** https://zju3dv.github.io/MatchAnything/ |
|
- **Paper:** https://huggingface.co/papers/2501.07556 |
|
|
|
## Uses |
|
|
|
MatchAnything-ELOFTR is designed for a vast array of applications requiring robust image matching, especially between different sensor types or imaging modalities. Its direct uses include: |
|
- **Medical Image Analysis**: Aligning CT-MR, PET-MR, and SPECT-MR scans. |
|
- **Histopathology**: Registering tissue images with different stains (e.g., H&E and IHC). |
|
- **Remote Sensing**: Matching satellite/aerial images from different sensors (e.g., Visible-SAR, Thermal-Visible). |
|
- **Autonomous Systems**: Enhancing localization and navigation for UAVs and autonomous vehicles by matching thermal or visible images to vectorized maps. |
|
- Single-Modality Tasks**: The model also retains strong performance on standard single-modality matching, such as retina image registration. |
|
|
|
### Direct Use |
|
|
|
Here is a quick example of using the model for matching a pair of images. |
|
|
|
_Make sure to use transformers from the following commit as a fix for this model got merged on main but is still not part of a released version :_ |
|
``` |
|
uv pip install "git+https://github.com/huggingface/transformers@22e89e538529420b2ddae6af70865655bc5c22d8" |
|
``` |
|
|
|
```python |
|
from transformers import AutoImageProcessor, AutoModelForKeypointMatching |
|
from transformers.image_utils import load_image |
|
import torch |
|
|
|
# Load a pair of images |
|
image1 = load_image("https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/united_states_capitol_98169888_3347710852.jpg") |
|
image2 = load_image("https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/united_states_capitol_26757027_6717084061.jpg") |
|
|
|
images = [image1, image2] |
|
|
|
# Load the processor and model from the Hugging Face Hub |
|
processor = AutoImageProcessor.from_pretrained("zju-community/matchanything_eloftr") |
|
model = AutoModelForKeypointMatching.from_pretrained("zju-community/matchanything_eloftr") |
|
|
|
# Process images and get model outputs |
|
inputs = processor(images, return_tensors="pt") |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
``` |
|
|
|
You can use the post_process_keypoint_matching method from the `EfficientLoFTRImageProcessor` to get the keypoints and matches in a readable format: |
|
```python |
|
image_sizes = [[(image.height, image.width) for image in images]] |
|
outputs = processor.post_process_keypoint_matching(outputs, image_sizes, threshold=0.2) |
|
for i, output in enumerate(outputs): |
|
print("For the image pair", i) |
|
for keypoint0, keypoint1, matching_score in zip( |
|
output["keypoints0"], output["keypoints1"], output["matching_scores"] |
|
): |
|
print( |
|
f"Keypoint at coordinate {keypoint0.numpy()} in the first image matches with keypoint at coordinate {keypoint1.numpy()} in the second image with a score of {matching_score}." |
|
) |
|
``` |
|
|
|
You can also visualize the matches between the images: |
|
|
|
```python |
|
plot_images = processor.visualize_keypoint_matching(images, outputs) |
|
``` |
|
|
|
 |
|
|
|
## Training Details |
|
MatchAnything-ELOFTR is trained end-to-end using the large-scale, cross-modality pre-training framework. |
|
|
|
### Training Data |
|
The model was not trained on a single dataset but on a massive collection generated by the Multi-Resources Data Mixture Training framework, totaling approximately 800 million image pairs. This framework leverages: |
|
Multi-View Images with Geometry: Datasets like MegaDepth, ScanNet++, and BlendedMVS provide realistic viewpoint changes with ground-truth depth. |
|
Video Sequences: The DL3DV-10k dataset is used, with pseudo ground-truth matches generated between distant frames via a novel coarse-to-fine strategy. |
|
Single-Image Datasets: Large datasets like GoogleLandmark and SA-1B are used with synthetic homography warping to maximize data diversity. |
|
Cross-Modality Stimulus Data: A key component where training pairs are augmented by generating synthetic modalities (thermal, nighttime, depth maps) from visible light images using models like CycleGAN and DepthAnything, encouraging the matcher to learn appearance-invariant features. |
|
|
|
### Training Procedure |
|
#### Training Hyperparameters |
|
|
|
Optimizer: AdamW |
|
Initial Learning Rate: 8×10⁻³ |
|
Batch Size: 64 |
|
Training Hardware: 16 NVIDIA A100-80G GPUs |
|
Training Time: Approximately 4.3 days for the ELOFTR variant |
|
|
|
#### Speeds, Sizes, Times |
|
Since the MatchAnything framework only changes the training process and weights, the model's architecture and running time are identical to the original ELOFTR model. |
|
Speed: For a 640x480 resolution image pair on a single NVIDIA RTX 3090 GPU, the model takes 40ms to process. |
|
|
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
```bibtext |
|
@article{he2025matchanything, |
|
title={MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training}, |
|
author={Xingyi He and Hao Yu and Sida Peng and Dongli Tan and Zehong Shen and Hujun Bao and Xiaowei Zhou}, |
|
year={2025}, |
|
eprint={2501.07556}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV} |
|
} |
|
``` |
|
|
|
## Model Card Authors |
|
|
|
[Steven Bucaille](https://github.com/sbucaille) |