|
--- |
|
license: mit |
|
tags: |
|
- low-light |
|
- low-light-image-enhancement |
|
- image-enhancement |
|
- image-restoration |
|
- computer-vision |
|
- low-light-enhance |
|
- multimodal |
|
- multimodal-learning |
|
- transformer |
|
- transformers |
|
- vision-transformer |
|
- vision-transformers |
|
model-index: |
|
- name: ModalFormer |
|
results: |
|
- task: |
|
type: low-light-image-enhancement |
|
dataset: |
|
name: LOL-v1 |
|
type: LOL-v1 |
|
metrics: |
|
- type: PSNR |
|
value: 27.97 |
|
name: PSNR |
|
- type: SSIM |
|
value: 0.897 |
|
name: SSIM |
|
- task: |
|
type: low-light-image-enhancement |
|
dataset: |
|
name: LOL-v2-Real |
|
type: LOL-v2-Real |
|
metrics: |
|
- type: PSNR |
|
value: 29.33 |
|
name: PSNR |
|
- type: SSIM |
|
value: 0.915 |
|
name: SSIM |
|
- task: |
|
type: low-light-image-enhancement |
|
dataset: |
|
name: LOL-v2-Synthetic |
|
type: LOL-v2-Synthetic |
|
metrics: |
|
- type: PSNR |
|
value: 30.15 |
|
name: PSNR |
|
- type: SSIM |
|
value: 0.951 |
|
name: SSIM |
|
- task: |
|
type: low-light-image-enhancement |
|
dataset: |
|
name: SDSD-indoor |
|
type: SDSD-indoor |
|
metrics: |
|
- type: PSNR |
|
value: 31.37 |
|
name: PSNR |
|
- type: SSIM |
|
value: 0.917 |
|
name: SSIM |
|
- task: |
|
type: low-light-image-enhancement |
|
dataset: |
|
name: SDSD-outdoor |
|
type: SDSD-outdoor |
|
metrics: |
|
- type: PSNR |
|
value: 31.73 |
|
name: PSNR |
|
- type: SSIM |
|
value: 0.904 |
|
name: SSIM |
|
- task: |
|
type: low-light-image-enhancement |
|
dataset: |
|
name: MEF |
|
type: MEF |
|
metrics: |
|
- type: NIQE |
|
value: 3.44 |
|
name: NIQE |
|
- task: |
|
type: low-light-image-enhancement |
|
dataset: |
|
name: LIME |
|
type: LIME |
|
metrics: |
|
- type: NIQE |
|
value: 3.82 |
|
name: NIQE |
|
- task: |
|
type: low-light-image-enhancement |
|
dataset: |
|
name: DICM |
|
type: DICM |
|
metrics: |
|
- type: NIQE |
|
value: 3.64 |
|
name: NIQE |
|
- task: |
|
type: low-light-image-enhancement |
|
dataset: |
|
name: NPE |
|
type: NPE |
|
metrics: |
|
- type: NIQE |
|
value: 3.55 |
|
name: NIQE |
|
pipeline_tag: image-to-image |
|
--- |
|
|
|
# ✨ ModalFormer: Multimodal Transformer for Low-Light Image Enhancement |
|
|
|
<div align="center"> |
|
|
|
**[Alexandru Brateanu](https://scholar.google.com/citations?user=ru0meGgAAAAJ&hl=en), [Raul Balmez](https://scholar.google.com/citations?user=vPC7raQAAAAJ&hl=en), [Ciprian Orhei](https://scholar.google.com/citations?user=DZHdq3wAAAAJ&hl=en), [Codruta Ancuti](https://scholar.google.com/citations?user=5PA43eEAAAAJ&hl=en), [Cosmin Ancuti](https://scholar.google.com/citations?user=zVTgt8IAAAAJ&hl=en)** |
|
|
|
[](https://arxiv.org/abs/2507.20388) |
|
</div> |
|
|
|
### Abstract |
|
*Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features—including deep feature embeddings, segmentation information, geometric cues, and color information—to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer’s state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer* |
|
|
|
## 🆕 Updates |
|
- `29.07.2025` 🎉 The [**ModalFormer**](https://arxiv.org/abs/2401.15204) paper is now available! Check it out and explore our results and methodology. |
|
- `28.07.2025` 📦 Pre-trained models and test data published! ArXiv paper version and HuggingFace demo coming soon, stay tuned! |
|
|
|
## ⚙️ Setup and Testing |
|
For ease, utilize a Linux machine with CUDA-ready devices (GPUs). |
|
|
|
To setup the environment, first run the provided setup script: |
|
|
|
```bash |
|
./environment_setup.sh |
|
# or |
|
bash environment_setup.sh |
|
``` |
|
|
|
Note: in case of difficulties, ensure ```environment_setup.sh``` is executable by running: |
|
|
|
```bash |
|
chmod +x environment_setup.sh |
|
``` |
|
|
|
Give the setup a couple of minutes to run. |
|
|
|
Please check out the [**GitHub repository**](https://github.com/albrateanu/ModalFormer) for more implementation details. |
|
|
|
## 📚 Citation |
|
|
|
``` |
|
@misc{brateanu2025modalformer, |
|
title={ModalFormer: Multimodal Transformer for Low-Light Image Enhancement}, |
|
author={Alexandru Brateanu and Raul Balmez and Ciprian Orhei and Codruta Ancuti and Cosmin Ancuti}, |
|
year={2025}, |
|
eprint={2507.20388}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2507.20388}, |
|
} |
|
``` |
|
|
|
## 🙏 Acknowledgements |
|
We use [this codebase](https://github.com/caiyuanhao1998/Retinexformer) as foundation for our implementation. |
|
|
|
Paper: https://arxiv.org/pdf/2507.20388 |