ModalFormer / README.md

Improve model card: Add pipeline tag and setup instructions (#1)

ac71fc0 verified 26 days ago

5.46 kB

	---
	license: mit
	tags:
	- low-light
	- low-light-image-enhancement
	- image-enhancement
	- image-restoration
	- computer-vision
	- low-light-enhance
	- multimodal
	- multimodal-learning
	- transformer
	- transformers
	- vision-transformer
	- vision-transformers
	model-index:
	- name: ModalFormer
	results:
	- task:
	type: low-light-image-enhancement
	dataset:
	name: LOL-v1
	type: LOL-v1
	metrics:
	- type: PSNR
	value: 27.97
	name: PSNR
	- type: SSIM
	value: 0.897
	name: SSIM
	- task:
	type: low-light-image-enhancement
	dataset:
	name: LOL-v2-Real
	type: LOL-v2-Real
	metrics:
	- type: PSNR
	value: 29.33
	name: PSNR
	- type: SSIM
	value: 0.915
	name: SSIM
	- task:
	type: low-light-image-enhancement
	dataset:
	name: LOL-v2-Synthetic
	type: LOL-v2-Synthetic
	metrics:
	- type: PSNR
	value: 30.15
	name: PSNR
	- type: SSIM
	value: 0.951
	name: SSIM
	- task:
	type: low-light-image-enhancement
	dataset:
	name: SDSD-indoor
	type: SDSD-indoor
	metrics:
	- type: PSNR
	value: 31.37
	name: PSNR
	- type: SSIM
	value: 0.917
	name: SSIM
	- task:
	type: low-light-image-enhancement
	dataset:
	name: SDSD-outdoor
	type: SDSD-outdoor
	metrics:
	- type: PSNR
	value: 31.73
	name: PSNR
	- type: SSIM
	value: 0.904
	name: SSIM
	- task:
	type: low-light-image-enhancement
	dataset:
	name: MEF
	type: MEF
	metrics:
	- type: NIQE
	value: 3.44
	name: NIQE
	- task:
	type: low-light-image-enhancement
	dataset:
	name: LIME
	type: LIME
	metrics:
	- type: NIQE
	value: 3.82
	name: NIQE
	- task:
	type: low-light-image-enhancement
	dataset:
	name: DICM
	type: DICM
	metrics:
	- type: NIQE
	value: 3.64
	name: NIQE
	- task:
	type: low-light-image-enhancement
	dataset:
	name: NPE
	type: NPE
	metrics:
	- type: NIQE
	value: 3.55
	name: NIQE
	pipeline_tag: image-to-image
	---

	# ✨ ModalFormer: Multimodal Transformer for Low-Light Image Enhancement

	<div align="center">

	[Alexandru Brateanu](https://scholar.google.com/citations?user=ru0meGgAAAAJ&hl=en), [Raul Balmez](https://scholar.google.com/citations?user=vPC7raQAAAAJ&hl=en), [Ciprian Orhei](https://scholar.google.com/citations?user=DZHdq3wAAAAJ&hl=en), [Codruta Ancuti](https://scholar.google.com/citations?user=5PA43eEAAAAJ&hl=en), [Cosmin Ancuti](https://scholar.google.com/citations?user=zVTgt8IAAAAJ&hl=en)

	[![arXiv](https://img.shields.io/badge/arxiv-paper-179bd3)](https://arxiv.org/abs/2507.20388)
	</div>

	### Abstract
	Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features—including deep feature embeddings, segmentation information, geometric cues, and color information—to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer’s state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer

	## 🆕 Updates
	- `29.07.2025` 🎉 The [ModalFormer](https://arxiv.org/abs/2401.15204) paper is now available! Check it out and explore our results and methodology.
	- `28.07.2025` 📦 Pre-trained models and test data published! ArXiv paper version and HuggingFace demo coming soon, stay tuned!

	## ⚙️ Setup and Testing
	For ease, utilize a Linux machine with CUDA-ready devices (GPUs).

	To setup the environment, first run the provided setup script:

	```bash
	./environment_setup.sh
	# or
	bash environment_setup.sh
	```

	Note: in case of difficulties, ensure ```environment_setup.sh``` is executable by running:

	```bash
	chmod +x environment_setup.sh
	```

	Give the setup a couple of minutes to run.

	Please check out the [GitHub repository](https://github.com/albrateanu/ModalFormer) for more implementation details.

	## 📚 Citation

	```
	@misc{brateanu2025modalformer,
	title={ModalFormer: Multimodal Transformer for Low-Light Image Enhancement},
	author={Alexandru Brateanu and Raul Balmez and Ciprian Orhei and Codruta Ancuti and Cosmin Ancuti},
	year={2025},
	eprint={2507.20388},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2507.20388},
	}
	```

	## 🙏 Acknowledgements
	We use [this codebase](https://github.com/caiyuanhao1998/Retinexformer) as foundation for our implementation.

	Paper: https://arxiv.org/pdf/2507.20388