Support fine-tuning (#7)
Browse files- Support fine-tuning (d94acaf1a57be2532adcb39c31836b80a21c043b)
Co-authored-by: tastelikefeet <[email protected]>
README.md
CHANGED
|
@@ -3762,6 +3762,48 @@ The [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) English t
|
|
| 3762 |
|
| 3763 |
**More detailed experimental results can be found in the [paper](http://arxiv.org/abs/2412.16855)**.
|
| 3764 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3765 |
## Limitations
|
| 3766 |
|
| 3767 |
- **Single Image Input**: In `Qwen2-VL`, an image could be converted into a very large number of visual tokens. We limit the number of visual tokens to 1024 to obtain a good training efficiency.
|
|
|
|
| 3762 |
|
| 3763 |
**More detailed experimental results can be found in the [paper](http://arxiv.org/abs/2412.16855)**.
|
| 3764 |
|
| 3765 |
+
## Community support
|
| 3766 |
+
|
| 3767 |
+
### Fine-tuning
|
| 3768 |
+
|
| 3769 |
+
GME models can be fine-tuned by SWIFT:
|
| 3770 |
+
|
| 3771 |
+
```shell
|
| 3772 |
+
pip install ms-swift -U
|
| 3773 |
+
```
|
| 3774 |
+
|
| 3775 |
+
```shell
|
| 3776 |
+
# MAX_PIXELS settings to reduce memory usage
|
| 3777 |
+
# check: https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html
|
| 3778 |
+
nproc_per_node=8
|
| 3779 |
+
MAX_PIXELS=1003520 \
|
| 3780 |
+
USE_HF=1 \
|
| 3781 |
+
NPROC_PER_NODE=$nproc_per_node \
|
| 3782 |
+
swift sft \
|
| 3783 |
+
--model Alibaba-NLP/gme-Qwen2-VL-7B-Instruct \
|
| 3784 |
+
--train_type lora \
|
| 3785 |
+
--dataset 'HuggingFaceM4/TextCaps:emb' \
|
| 3786 |
+
--torch_dtype bfloat16 \
|
| 3787 |
+
--num_train_epochs 1 \
|
| 3788 |
+
--per_device_train_batch_size 2 \
|
| 3789 |
+
--per_device_eval_batch_size 2 \
|
| 3790 |
+
--gradient_accumulation_steps $(expr 64 / $nproc_per_node) \
|
| 3791 |
+
--eval_steps 100 \
|
| 3792 |
+
--save_steps 100 \
|
| 3793 |
+
--eval_strategy steps \
|
| 3794 |
+
--save_total_limit 5 \
|
| 3795 |
+
--logging_steps 5 \
|
| 3796 |
+
--output_dir output \
|
| 3797 |
+
--lazy_tokenize true \
|
| 3798 |
+
--warmup_ratio 0.05 \
|
| 3799 |
+
--learning_rate 5e-6 \
|
| 3800 |
+
--deepspeed zero3 \
|
| 3801 |
+
--dataloader_num_workers 4 \
|
| 3802 |
+
--task_type embedding \
|
| 3803 |
+
--loss_type infonce \
|
| 3804 |
+
--dataloader_drop_last true
|
| 3805 |
+
```
|
| 3806 |
+
|
| 3807 |
## Limitations
|
| 3808 |
|
| 3809 |
- **Single Image Input**: In `Qwen2-VL`, an image could be converted into a very large number of visual tokens. We limit the number of visual tokens to 1024 to obtain a good training efficiency.
|