license: mit
tags:
- low-light
- low-light-image-enhancement
- image-enhancement
- image-restoration
- computer-vision
- low-light-enhance
- multimodal
- multimodal-learning
- transformer
- transformers
- vision-transformer
model-index:
- name: ModalFormer
results:
- task:
type: low-light-image-enhancement
dataset:
name: LOL-v1
type: LOL-v1
metrics:
- name: PSNR
type: PSNR
value: 27.97
- name: SSIM
type: SSIM
value: 0.897
- task:
type: low-light-image-enhancement
dataset:
name: LOL-v2-Real
type: LOL-v2-Real
metrics:
- name: PSNR
type: PSNR
value: 29.33
- name: SSIM
type: SSIM
value: 0.915
- task:
type: low-light-image-enhancement
dataset:
name: LOL-v2-Synthetic
type: LOL-v2-Synthetic
metrics:
- name: PSNR
type: PSNR
value: 30.15
- name: SSIM
type: SSIM
value: 0.951
- task:
type: low-light-image-enhancement
dataset:
name: SDSD-indoor
type: SDSD-indoor
metrics:
- name: PSNR
type: PSNR
value: 31.37
- name: SSIM
type: SSIM
value: 0.917
- task:
type: low-light-image-enhancement
dataset:
name: SDSD-outdoor
type: SDSD-outdoor
metrics:
- name: PSNR
type: PSNR
value: 31.73
- name: SSIM
type: SSIM
value: 0.904
- task:
type: low-light-image-enhancement
dataset:
name: MEF
type: MEF
metrics:
- name: NIQE
type: NIQE
value: 3.44
- task:
type: low-light-image-enhancement
dataset:
name: LIME
type: LIME
metrics:
- name: NIQE
type: NIQE
value: 3.82
- task:
type: low-light-image-enhancement
dataset:
name: DICM
type: DICM
metrics:
- name: NIQE
type: NIQE
value: 3.64
- task:
type: low-light-image-enhancement
dataset:
name: NPE
type: NPE
metrics:
- name: NIQE
type: NIQE
value: 3.55
✨ ModalFormer: Multimodal Transformer for Low-Light Image Enhancement
Abstract
Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features—including deep feature embeddings, segmentation information, geometric cues, and color information—to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer’s state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer
🆕 Updates
29.07.2025
🎉 The ModalFormer paper is now available! Check it out and explore our results and methodology.28.07.2025
📦 Pre-trained models and test data published! ArXiv paper version and HuggingFace demo coming soon, stay tuned!
⚙️ Setup and Testing
Please check out the GitHub repository for implementation details.
📚 Citation
@misc{brateanu2025modalformer,
title={ModalFormer: Multimodal Transformer for Low-Light Image Enhancement},
author={Alexandru Brateanu and Raul Balmez and Ciprian Orhei and Codruta Ancuti and Cosmin Ancuti},
year={2025},
eprint={2507.20388},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.20388},
}
🙏 Acknowledgements
We use this codebase as foundation for our implementation.