nielsr HF Staff commited on
Commit
cd225e8
·
verified ·
1 Parent(s): 1e22131

Improve model card for MLLMSeg_InternVL2_5_1B_RES

Browse files

This PR significantly enhances the model card for the `MLLMSeg_InternVL2_5_1B_RES` model by:

- Adding the `pipeline_tag: image-segmentation`, which makes the model discoverable on the Hugging Face Hub under this pipeline (https://huggingface.co/models?pipeline_tag=image-segmentation).
- Specifying `library_name: transformers`, enabling the "How to use" widget on the model page for easier integration.
- Linking directly to the paper: [Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder](https://huggingface.co/papers/2508.04107).
- Including a direct link to the official GitHub repository for access to the code.
- Adding a comprehensive "How to Use" section with a runnable Python code snippet for inference, along with important notes regarding input/output processing.
- Incorporating performance metrics and visualization examples from the original GitHub repository to provide a clearer understanding of the model's capabilities.

This update aims to greatly improve the model's visibility and usability for the community.

Files changed (1) hide show
  1. README.md +181 -3
README.md CHANGED
@@ -1,3 +1,181 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: image-segmentation
4
+ library_name: transformers
5
+ ---
6
+
7
+ # MLLMSeg: Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder
8
+
9
+ This repository contains the `MLLMSeg_InternVL2_5_1B_RES` model, which was presented in the paper [Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder](https://huggingface.co/papers/2508.04107).
10
+
11
+ **MLLMSeg** aims to segment image regions specified by referring expressions. While Multimodal Large Language Models (MLLMs) are proficient in semantic understanding, their token-generation approach often struggles with pixel-level dense prediction tasks like segmentation. To address this, MLLMSeg proposes a novel framework that fully leverages the inherent visual detail features encoded in the MLLM's vision encoder, eliminating the need for an extra visual encoder. It further introduces a detail-enhanced and semantic-consistent feature fusion module (DSFF) to integrate visual details with semantic features from the Large Language Model (LLM). Finally, a lightweight mask decoder (with only 34M parameters) is established to optimize the use of these features for precise mask prediction. This approach strikes a better balance between performance and computational cost compared to existing SAM-based and SAM-free methods.
12
+
13
+ The official code is available on GitHub: [https://github.com/jcwang0602/MLLMSeg](https://github.com/jcwang0602/MLLMSeg)
14
+
15
+ ## Model Architecture
16
+ <p align="center">
17
+ <img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/method.png" width="800">
18
+ </p>
19
+
20
+ ## Quick Start / How to Use
21
+
22
+ This section provides instructions on how to use our pre-trained model for inference. Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (e.g., a bounding box defined by top-left and bottom-right coordinates). For visualization, you will need to convert these relative coordinates back to the original image dimensions.
23
+
24
+ ### Installation
25
+
26
+ First, install the `transformers` library and other necessary dependencies. Note that `flash-attn` requires a GPU for installation.
27
+
28
+ ```bash
29
+ conda create -n mllmseg python==3.10.18 -y
30
+ conda activate mllmseg
31
+ pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu118 # Adjust for your CUDA version
32
+ pip install -r requirements.txt # Assuming requirements.txt from the cloned repo
33
+ pip install flash-attn==2.3.6 --no-build-isolation # Note: requires GPU to install
34
+ ```
35
+
36
+ ### Inference Code Example
37
+
38
+ ```python
39
+ import numpy as np
40
+ import torch
41
+ import torchvision.transforms as T
42
+ from PIL import Image
43
+ from torchvision.transforms.functional import InterpolationMode
44
+ from transformers import AutoModel, AutoTokenizer
45
+
46
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
47
+ IMAGENET_STD = (0.229, 0.224, 0.225)
48
+
49
+ def build_transform(input_size):
50
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
51
+ transform = T.Compose([
52
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
53
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
54
+ T.ToTensor(),
55
+ T.Normalize(mean=MEAN, std=STD)
56
+ ])
57
+ return transform
58
+
59
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
60
+ best_ratio_diff = float('inf')
61
+ best_ratio = (1, 1)
62
+ area = width * height
63
+ for ratio in target_ratios:
64
+ target_aspect_ratio = ratio[0] / ratio[1]
65
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
66
+ if ratio_diff < best_ratio_diff:
67
+ best_ratio_diff = ratio_diff
68
+ best_ratio = ratio
69
+ elif ratio_diff == best_ratio_diff:
70
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
71
+ best_ratio = ratio
72
+ return best_ratio
73
+
74
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
75
+ orig_width, orig_height = image.size
76
+ aspect_ratio = orig_width / orig_height
77
+
78
+ # calculate the existing image aspect ratio
79
+ target_ratios = set(
80
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
81
+ i * j <= max_num and i * j >= min_num)
82
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
83
+
84
+ # find the closest aspect ratio to the target
85
+ target_aspect_ratio = find_closest_aspect_ratio(
86
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
87
+
88
+ # calculate the target width and height
89
+ target_width = image_size * target_aspect_ratio[0]
90
+ target_height = image_size * target_aspect_ratio[1]
91
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
92
+
93
+ # resize the image
94
+ resized_img = image.resize((target_width, target_height))
95
+ processed_images = []
96
+ for i in range(blocks):
97
+ box = (
98
+ (i % (target_width // image_size)) * image_size,
99
+ (i // (target_width // image_size)) * image_size,
100
+ ((i % (target_width // image_size)) + 1) * image_size,
101
+ ((i // (target_width // image_size)) + 1) * image_size
102
+ )
103
+ # split the image
104
+ split_img = resized_img.crop(box)
105
+ processed_images.append(split_img)
106
+ assert len(processed_images) == blocks
107
+ if use_thumbnail and len(processed_images) != 1:
108
+ thumbnail_img = image.resize((image_size, image_size))
109
+ processed_images.append(thumbnail_img)
110
+ return processed_images
111
+
112
+ def load_image(image_file, input_size=448, max_num=12):
113
+ image = Image.open(image_file).convert('RGB')
114
+ transform = build_transform(input_size=input_size)
115
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
116
+ pixel_values = [transform(image) for image in images]
117
+ pixel_values = torch.stack(pixel_values)
118
+ return pixel_values
119
+
120
+ # Load the model and tokenizer
121
+ # Note: trust_remote_code=True is required for this model architecture
122
+ model_path = 'jcwang0602/MLLMSeg_InternVL2_5_1B_RES'
123
+ model = AutoModel.from_pretrained(
124
+ model_path,
125
+ torch_dtype=torch.bfloat16,
126
+ low_cpu_mem_usage=True,
127
+ trust_remote_code=True).eval().cuda()
128
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
129
+
130
+ # Example image (replace with your image path)
131
+ # You can find example images in the GitHub repository of MLLMSeg, e.g., in the 'examples/images' directory.
132
+ image_path = './path/to/your/image.png'
133
+ pixel_values = load_image(image_path, max_num=6).to(torch.bfloat16).cuda()
134
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
135
+
136
+ # Example query for referring expression segmentation
137
+ question = "Please segment the person in the image." # Replace with your specific referring expression
138
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
139
+ print(f'User: {question}
140
+ Assistant: {response}')
141
+
142
+ # The 'response' will contain the segmentation mask coordinates in a specific format (normalized 0-1000).
143
+ # You will need to parse these coordinates and visualize the mask as per the paper's methodology or example scripts.
144
+ ```
145
+
146
+ ## Performance Metrics
147
+
148
+ ### Referring Expression Segmentation
149
+ <img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_res.png" width="800">
150
+
151
+ ### Referring Expression Comprehension
152
+ <img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_rec.png" width="800">
153
+
154
+ ### Generalized Referring Expression Segmentation
155
+ <img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_gres.png" width="800">
156
+
157
+ ## Visualization
158
+
159
+ ### Referring Expression Segmentation
160
+ <img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/res.png" width="800">
161
+
162
+ ### Referring Expression Comprehension
163
+ <img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/rec.png" width="800">
164
+
165
+ ### Generalized Referring Expression Segmentation
166
+ <img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/gres.png" width="800">
167
+
168
+ ## Citation
169
+ If our work is useful for your research, please consider citing:
170
+
171
+ ```bibtex
172
+ @misc{wang2025unlockingpotentialmllmsreferring,
173
+ title={Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder},
174
+ author={Jingchao Wang and Zhijian Wu and Dingjiang Huang and Yefeng Zheng and Hong Wang},
175
+ year={2025},
176
+ eprint={2508.04107},
177
+ archivePrefix={arXiv},
178
+ primaryClass={cs.CV},
179
+ url={https://arxiv.org/abs/2508.04107},
180
+ }
181
+ ```