OpenGVLab
/

InternVL3-8B-hf

@@ -5,7 +5,7 @@ license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
 pipeline_tag: image-text-to-text
 library_name: transformers
 base_model:
-  - OpenGVLab/InternVL3-1B-Instruct
 base_model_relation: finetune
 datasets:
   - OpenGVLab/MMPR-v1.2
@@ -15,7 +15,7 @@ tags:
   - internvl
 ---
-# InternVL3-1B Transformers 🤗 Implementation
 [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238)  [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821)  [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271)  [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442)  [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479)
@@ -27,7 +27,7 @@ tags:
 > [!IMPORTANT]
-> This repository contains the Hugging Face 🤗 Transformers implementation for the [OpenGVLab/InternVL3-1B](https://huggingface.co/OpenGVLab/InternVL3-1B) model.
 > It is intended to be functionally equivalent to the original OpenGVLab release.
 > As a native Transformers model, it supports core library features such as various attention implementations (eager, including SDPA, and FA2) and enables efficient batched inference with interleaved image, video, and text inputs.
@@ -39,7 +39,7 @@ Additionally, we compare InternVL3 with  Qwen2.5 Chat models, whose correspondin
 ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL-Performance/resolve/main/internvl3/overall.png)
-You can find more info on the InternVL3 family in the original checkpoint [OpenGVLab/InternVL3-1B](https://huggingface.co/OpenGVLab/InternVL3-1B)
 ## Usage example
@@ -63,7 +63,7 @@ Here is how you can use the `image-text-to-text` pipeline to perform inference w
 ...     },
 ... ]
->>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-1B-hf")
 >>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
 >>> outputs[0]["generated_text"]
 'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n   - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'
@@ -80,7 +80,7 @@ This example demonstrates how to perform inference on a single image with the In
 >>> import torch
 >>> torch_device = "cuda"
->>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
@@ -112,7 +112,7 @@ This example shows how to generate text using the InternVL model without providi
 >>> import torch
 >>> torch_device = "cuda"
->>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
@@ -142,7 +142,7 @@ InternVL models also support batched image and text inputs.
 >>> import torch
 >>> torch_device = "cuda"
->>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
@@ -186,7 +186,7 @@ This implementation of the InternVL models supports batched text-images inputs w
 >>> import torch
 >>> torch_device = "cuda"
->>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
@@ -268,7 +268,7 @@ This example showcases how to handle a batch of chat conversations with interlea
 >>> import torch
 >>> torch_device = "cuda"
->>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

 pipeline_tag: image-text-to-text
 library_name: transformers
 base_model:
+  - OpenGVLab/InternVL3-8B-Instruct
 base_model_relation: finetune
 datasets:
   - OpenGVLab/MMPR-v1.2
   - internvl
 ---
+# InternVL3-8B Transformers 🤗 Implementation
 [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238)  [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821)  [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271)  [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442)  [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479)
 > [!IMPORTANT]
+> This repository contains the Hugging Face 🤗 Transformers implementation for the [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B) model.
 > It is intended to be functionally equivalent to the original OpenGVLab release.
 > As a native Transformers model, it supports core library features such as various attention implementations (eager, including SDPA, and FA2) and enables efficient batched inference with interleaved image, video, and text inputs.
 ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL-Performance/resolve/main/internvl3/overall.png)
+You can find more info on the InternVL3 family in the original checkpoint [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B)
 ## Usage example
 ...     },
 ... ]
+>>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-8B-hf")
 >>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
 >>> outputs[0]["generated_text"]
 'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n   - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'
 >>> import torch
 >>> torch_device = "cuda"
+>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
 >>> import torch
 >>> torch_device = "cuda"
+>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
 >>> import torch
 >>> torch_device = "cuda"
+>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
 >>> import torch
 >>> torch_device = "cuda"
+>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
 >>> import torch
 >>> torch_device = "cuda"
+>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)