Update README.md
Browse files
README.md
CHANGED
|
@@ -5,7 +5,7 @@ license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
|
|
| 5 |
pipeline_tag: image-text-to-text
|
| 6 |
library_name: transformers
|
| 7 |
base_model:
|
| 8 |
-
- OpenGVLab/InternVL3-
|
| 9 |
base_model_relation: finetune
|
| 10 |
datasets:
|
| 11 |
- OpenGVLab/MMPR-v1.2
|
|
@@ -15,7 +15,7 @@ tags:
|
|
| 15 |
- internvl
|
| 16 |
---
|
| 17 |
|
| 18 |
-
# InternVL3-
|
| 19 |
|
| 20 |
[\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479)
|
| 21 |
|
|
@@ -27,7 +27,7 @@ tags:
|
|
| 27 |
|
| 28 |
|
| 29 |
> [!IMPORTANT]
|
| 30 |
-
> This repository contains the Hugging Face 🤗 Transformers implementation for the [OpenGVLab/InternVL3-
|
| 31 |
> It is intended to be functionally equivalent to the original OpenGVLab release.
|
| 32 |
> As a native Transformers model, it supports core library features such as various attention implementations (eager, including SDPA, and FA2) and enables efficient batched inference with interleaved image, video, and text inputs.
|
| 33 |
|
|
@@ -39,7 +39,7 @@ Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose correspondin
|
|
| 39 |
|
| 40 |

|
| 41 |
|
| 42 |
-
You can find more info on the InternVL3 family in the original checkpoint [OpenGVLab/InternVL3-
|
| 43 |
|
| 44 |
## Usage example
|
| 45 |
|
|
@@ -63,7 +63,7 @@ Here is how you can use the `image-text-to-text` pipeline to perform inference w
|
|
| 63 |
... },
|
| 64 |
... ]
|
| 65 |
|
| 66 |
-
>>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-
|
| 67 |
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
|
| 68 |
>>> outputs[0]["generated_text"]
|
| 69 |
'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'
|
|
@@ -80,7 +80,7 @@ This example demonstrates how to perform inference on a single image with the In
|
|
| 80 |
>>> import torch
|
| 81 |
|
| 82 |
>>> torch_device = "cuda"
|
| 83 |
-
>>> model_checkpoint = "OpenGVLab/InternVL3-
|
| 84 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
| 85 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
| 86 |
|
|
@@ -112,7 +112,7 @@ This example shows how to generate text using the InternVL model without providi
|
|
| 112 |
>>> import torch
|
| 113 |
|
| 114 |
>>> torch_device = "cuda"
|
| 115 |
-
>>> model_checkpoint = "OpenGVLab/InternVL3-
|
| 116 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
| 117 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
| 118 |
|
|
@@ -142,7 +142,7 @@ InternVL models also support batched image and text inputs.
|
|
| 142 |
>>> import torch
|
| 143 |
|
| 144 |
>>> torch_device = "cuda"
|
| 145 |
-
>>> model_checkpoint = "OpenGVLab/InternVL3-
|
| 146 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
| 147 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
| 148 |
|
|
@@ -186,7 +186,7 @@ This implementation of the InternVL models supports batched text-images inputs w
|
|
| 186 |
>>> import torch
|
| 187 |
|
| 188 |
>>> torch_device = "cuda"
|
| 189 |
-
>>> model_checkpoint = "OpenGVLab/InternVL3-
|
| 190 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
| 191 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
| 192 |
|
|
@@ -268,7 +268,7 @@ This example showcases how to handle a batch of chat conversations with interlea
|
|
| 268 |
>>> import torch
|
| 269 |
|
| 270 |
>>> torch_device = "cuda"
|
| 271 |
-
>>> model_checkpoint = "OpenGVLab/InternVL3-
|
| 272 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
| 273 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
| 274 |
|
|
|
|
| 5 |
pipeline_tag: image-text-to-text
|
| 6 |
library_name: transformers
|
| 7 |
base_model:
|
| 8 |
+
- OpenGVLab/InternVL3-8B-Instruct
|
| 9 |
base_model_relation: finetune
|
| 10 |
datasets:
|
| 11 |
- OpenGVLab/MMPR-v1.2
|
|
|
|
| 15 |
- internvl
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# InternVL3-8B Transformers 🤗 Implementation
|
| 19 |
|
| 20 |
[\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479)
|
| 21 |
|
|
|
|
| 27 |
|
| 28 |
|
| 29 |
> [!IMPORTANT]
|
| 30 |
+
> This repository contains the Hugging Face 🤗 Transformers implementation for the [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B) model.
|
| 31 |
> It is intended to be functionally equivalent to the original OpenGVLab release.
|
| 32 |
> As a native Transformers model, it supports core library features such as various attention implementations (eager, including SDPA, and FA2) and enables efficient batched inference with interleaved image, video, and text inputs.
|
| 33 |
|
|
|
|
| 39 |
|
| 40 |

|
| 41 |
|
| 42 |
+
You can find more info on the InternVL3 family in the original checkpoint [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B)
|
| 43 |
|
| 44 |
## Usage example
|
| 45 |
|
|
|
|
| 63 |
... },
|
| 64 |
... ]
|
| 65 |
|
| 66 |
+
>>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-8B-hf")
|
| 67 |
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
|
| 68 |
>>> outputs[0]["generated_text"]
|
| 69 |
'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'
|
|
|
|
| 80 |
>>> import torch
|
| 81 |
|
| 82 |
>>> torch_device = "cuda"
|
| 83 |
+
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
|
| 84 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
| 85 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
| 86 |
|
|
|
|
| 112 |
>>> import torch
|
| 113 |
|
| 114 |
>>> torch_device = "cuda"
|
| 115 |
+
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
|
| 116 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
| 117 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
| 118 |
|
|
|
|
| 142 |
>>> import torch
|
| 143 |
|
| 144 |
>>> torch_device = "cuda"
|
| 145 |
+
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
|
| 146 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
| 147 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
| 148 |
|
|
|
|
| 186 |
>>> import torch
|
| 187 |
|
| 188 |
>>> torch_device = "cuda"
|
| 189 |
+
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
|
| 190 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
| 191 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
| 192 |
|
|
|
|
| 268 |
>>> import torch
|
| 269 |
|
| 270 |
>>> torch_device = "cuda"
|
| 271 |
+
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
|
| 272 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
| 273 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
| 274 |
|