Alibaba-NLP
/

gme-Qwen2-VL-7B-Instruct

@@ -3762,6 +3762,48 @@ The [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) English t
 **More detailed experimental results can be found in the [paper](http://arxiv.org/abs/2412.16855)**.
 ## Limitations
 - **Single Image Input**: In `Qwen2-VL`, an image could be converted into a very large number of visual tokens. We limit the number of visual tokens to 1024 to obtain a good training efficiency.

 **More detailed experimental results can be found in the [paper](http://arxiv.org/abs/2412.16855)**.
+## Community support
+### Fine-tuning
+GME models can be fine-tuned by SWIFT：
+```shell
+pip install ms-swift -U
+```
+```shell
+# MAX_PIXELS settings to reduce memory usage
+# check: https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html
+nproc_per_node=8
+MAX_PIXELS=1003520 \
+USE_HF=1 \
+NPROC_PER_NODE=$nproc_per_node \
+swift sft \
+    --model Alibaba-NLP/gme-Qwen2-VL-7B-Instruct \
+    --train_type lora \
+    --dataset 'HuggingFaceM4/TextCaps:emb' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps $(expr 64 / $nproc_per_node) \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --eval_strategy steps \
+    --save_total_limit 5 \
+    --logging_steps 5 \
+    --output_dir output \
+    --lazy_tokenize true \
+    --warmup_ratio 0.05 \
+    --learning_rate 5e-6 \
+    --deepspeed zero3 \
+    --dataloader_num_workers 4 \
+    --task_type embedding \
+    --loss_type infonce \
+    --dataloader_drop_last true
+```
 ## Limitations
 - **Single Image Input**: In `Qwen2-VL`, an image could be converted into a very large number of visual tokens. We limit the number of visual tokens to 1024 to obtain a good training efficiency.