Update README.md
Browse files
README.md
CHANGED
@@ -9,10 +9,9 @@ library_name: transformers
|
|
9 |
Recipes for shrinking, optimizing, customizing cutting edge vision and multimodal AI models. Original GH repository is [here](https://github.com/merveenoyan/smol-vision) migrated to Hugging Face since notebooks there aren't rendered 🥲
|
10 |
|
11 |
Latest examples 👇🏻
|
12 |
-
- [Fine-tuning SmolVLM2 on Video Captioning](https://huggingface.co/merve/smol-vision/blob/main/Fine_tune_SmolVLM2_on_Video.ipynb)
|
13 |
-
- [Multimodal RAG using ColPali and Qwen2-VL](https://huggingface.co/merve/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb)
|
14 |
- [Fine-tune ColPali for Multimodal RAG](https://huggingface.co/merve/smol-vision/blob/main/Finetune_ColPali.ipynb)
|
15 |
-
|
|
|
16 |
**Note**: The script and notebook are updated to fix few issues related to QLoRA!
|
17 |
|
18 |
| | Notebook | Description |
|
@@ -28,5 +27,7 @@ Latest examples 👇🏻
|
|
28 |
| VLM Fine-tuning (Script) | [QLoRA Fine-tune IDEFICS3 on VQAv2](https://huggingface.co/merve/smol-vision/blob/main/smolvlm.py) | QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset |
|
29 |
| Multimodal RAG | [Multimodal RAG using ColPali and Qwen2-VL](https://huggingface.co/merve/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb) | Learn to retrieve documents and pipeline to RAG without hefty document processing using ColPali through Byaldi and do the generation with Qwen2-VL |
|
30 |
| Multimodal Retriever Fine-tuning | [Fine-tune ColPali for Multimodal RAG](https://huggingface.co/merve/smol-vision/blob/main/Finetune_ColPali.ipynb) | Learn to apply contrastive fine-tuning on ColPali to customize it for your own multimodal document RAG use case |
|
|
|
|
|
31 |
| Speed-up/Memory Optimization | Vision language model serving using TGI (SOON) | Explore speed-ups and memory improvements for vision-language model serving with text-generation inference |
|
32 |
-
| Quantization/Optimum/ORT | All levels of quantization and graph optimizations for Image Segmentation using Optimum (SOON) | End-to-end model optimization using Optimum |
|
|
|
9 |
Recipes for shrinking, optimizing, customizing cutting edge vision and multimodal AI models. Original GH repository is [here](https://github.com/merveenoyan/smol-vision) migrated to Hugging Face since notebooks there aren't rendered 🥲
|
10 |
|
11 |
Latest examples 👇🏻
|
|
|
|
|
12 |
- [Fine-tune ColPali for Multimodal RAG](https://huggingface.co/merve/smol-vision/blob/main/Finetune_ColPali.ipynb)
|
13 |
+
- [Fine-tune Gemma-3n for all modalities (audio-text-image)](https://huggingface.co/merve/smol-vision/blob/main/Gemma3n_Fine_tuning_on_All_Modalities.ipynb)
|
14 |
+
- [Any-to-Any (Video) RAG with OmniEmbed and Qwen](https://huggingface.co/merve/smol-vision/blob/main/Any_to_Any_RAG.ipynb)
|
15 |
**Note**: The script and notebook are updated to fix few issues related to QLoRA!
|
16 |
|
17 |
| | Notebook | Description |
|
|
|
27 |
| VLM Fine-tuning (Script) | [QLoRA Fine-tune IDEFICS3 on VQAv2](https://huggingface.co/merve/smol-vision/blob/main/smolvlm.py) | QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset |
|
28 |
| Multimodal RAG | [Multimodal RAG using ColPali and Qwen2-VL](https://huggingface.co/merve/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb) | Learn to retrieve documents and pipeline to RAG without hefty document processing using ColPali through Byaldi and do the generation with Qwen2-VL |
|
29 |
| Multimodal Retriever Fine-tuning | [Fine-tune ColPali for Multimodal RAG](https://huggingface.co/merve/smol-vision/blob/main/Finetune_ColPali.ipynb) | Learn to apply contrastive fine-tuning on ColPali to customize it for your own multimodal document RAG use case |
|
30 |
+
| VLM Fine-tuning | [Fine-tune Gemma-3n for all modalities (audio-text-image)](https://huggingface.co/merve/smol-vision/blob/main/Gemma3n_Fine_tuning_on_All_Modalities.ipynb) | Fine-tune Gemma-3n model to handle any modality: audio, text, and image. |
|
31 |
+
| Multimodal RAG | [Any-to-Any (Video) RAG with OmniEmbed and Qwen](https://huggingface.co/merve/smol-vision/blob/main/Any_to_Any_RAG.ipynb) | Do retrieval and generation across modalities (including video) using OmniEmbed and Qwen. |
|
32 |
| Speed-up/Memory Optimization | Vision language model serving using TGI (SOON) | Explore speed-ups and memory improvements for vision-language model serving with text-generation inference |
|
33 |
+
| Quantization/Optimum/ORT | All levels of quantization and graph optimizations for Image Segmentation using Optimum (SOON) | End-to-end model optimization using Optimum |
|