MLLMSeg: Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder

This repository contains the MLLMSeg_InternVL2_5_1B_RES model, which was presented in the paper Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder.

MLLMSeg aims to segment image regions specified by referring expressions. While Multimodal Large Language Models (MLLMs) are proficient in semantic understanding, their token-generation approach often struggles with pixel-level dense prediction tasks like segmentation. To address this, MLLMSeg proposes a novel framework that fully leverages the inherent visual detail features encoded in the MLLM's vision encoder, eliminating the need for an extra visual encoder. It further introduces a detail-enhanced and semantic-consistent feature fusion module (DSFF) to integrate visual details with semantic features from the Large Language Model (LLM). Finally, a lightweight mask decoder (with only 34M parameters) is established to optimize the use of these features for precise mask prediction. This approach strikes a better balance between performance and computational cost compared to existing SAM-based and SAM-free methods.

The official code is available on GitHub: https://github.com/jcwang0602/MLLMSeg

Model Architecture

Quick Start / How to Use

This section provides instructions on how to use our pre-trained model for inference. Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (e.g., a bounding box defined by top-left and bottom-right coordinates). For visualization, you will need to convert these relative coordinates back to the original image dimensions.

Installation

First, install the transformers library and other necessary dependencies. Note that flash-attn requires a GPU for installation.

conda create -n mllmseg python==3.10.18 -y
conda activate mllmseg
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu118 # Adjust for your CUDA version
pip install -r requirements.txt # Assuming requirements.txt from the cloned repo
pip install flash-attn==2.3.6 --no-build-isolation # Note: requires GPU to install

Usage

Refer to the Github README:

The 'response' will contain the segmentation mask coordinates in a specific format (normalized 0-1000).

You will need to parse these coordinates and visualize the mask as per the paper's methodology or example scripts.


## Performance Metrics

### Referring Expression Segmentation
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_res.png" width="800">

### Referring Expression Comprehension
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_rec.png" width="800">

### Generalized Referring Expression Segmentation
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_gres.png" width="800">

## Visualization

### Referring Expression Segmentation
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/res.png" width="800">

### Referring Expression Comprehension
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/rec.png" width="800">

### Generalized Referring Expression Segmentation
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/gres.png" width="800">

## Citation
If our work is useful for your research, please consider citing:

```bibtex
@misc{wang2025unlockingpotentialmllmsreferring,
      title={Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder}, 
      author={Jingchao Wang and Zhijian Wu and Dingjiang Huang and Yefeng Zheng and Hong Wang},
      year={2025},
      eprint={2508.04107},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.04107}, 
}

jcwang0602
/

MLLMSeg_InternVL2_5_1B_RES