OpenGVLab
/

InternVL-Chat-V1-2-Plus

@@ -11,25 +11,47 @@ pipeline_tag: visual-question-answering
 ---
 # Model Card for InternVL-Chat-V1.2-Plus
-<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/X8AXMkOlKeUpNcoJIXKna.webp" alt="Image Description" width="300" height="300">
 \[[Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
-| Model                   | Date       | Download                                                                    | Note                               |
-| ----------------------- | ---------- | --------------------------------------------------------------------------- | ---------------------------------- |
-| InternVL-Chat-V1.5      | 2024.04.18 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)            | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
-| InternVL-Chat-V1.2-Plus | 2024.02.21 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus)       | more SFT data and stronger  |
-| InternVL-Chat-V1.2      | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2)            | scaling up LLM to 34B       |
-| InternVL-Chat-V1.1      | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)            | support Chinese and stronger OCR   |
-## InternVL-Chat-V1.2-Plus Blog
-InternVL-Chat-V1.2-Plus uses the same model architecture as [InternVL-Chat-V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2), but the difference lies in the SFT dataset. InternVL-Chat-V1.2 only utilizes an SFT dataset with 1.2M samples, while **our plus version employs an SFT dataset with 12M samples**.
-<img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
-### Performance
 \* Proprietary Model   &nbsp;&nbsp;&nbsp;&nbsp;   † Training Set Observed
@@ -49,21 +71,6 @@ InternVL-Chat-V1.2-Plus uses the same model architecture as [InternVL-Chat-V1.2]
 - Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA results have been corrected.
-## Model Details
-- **Model Type:** multimodal large language model (MLLM)
-- **Model Stats:**
-  - Architecture: [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)
-  - Image size: 448 x 448 (256 tokens)
-  - Params: 40B
-- **Training Strategy:**
-  - Pretraining Stage
-    - Learnable Component: MLP
-    - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
-    - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
-  - Supervised Finetuning Stage
-    - Learnable Component: ViT + MLP + LLM
-    - Data: 12 million SFT samples.
 ## Model Usage

 ---
 # Model Card for InternVL-Chat-V1.2-Plus
+<p align="center">
+  <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/X8AXMkOlKeUpNcoJIXKna.webp" alt="Image Description" width="300" height="300">
+</p>
 \[[Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
+InternVL-Chat-V1.2-Plus uses the same model architecture as [InternVL-Chat-V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2), but the difference lies in the SFT dataset. InternVL-Chat-V1.2 only utilizes an SFT dataset with 1.2M samples, while **our plus version employs an SFT dataset with 12M samples**.
+<p align="center">
+  <img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
+</p>
+## Model Details
+- **Model Type:** multimodal large language model (MLLM)
+- **Model Stats:**
+  - Architecture: [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)
+  - Image size: 448 x 448 (256 tokens)
+  - Params: 40B
+- **Training Strategy:**
+  - Pretraining Stage
+    - Learnable Component: MLP
+    - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
+    - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
+  - Supervised Finetuning Stage
+    - Learnable Component: ViT + MLP + LLM
+    - Data: 12 million SFT samples.
+## Released Models
+| Model                                                      | Vision Foundation Model                                                     | Release Date           |Note                                |
+| :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
+| InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5))      | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5))    |2024.04.18       |          support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
+| InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))    |2024.02.21     |        more SFT data and stronger  |
+| InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) )      |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))     |2024.02.11       |             scaling up LLM to 34B       |
+| InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1))      |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0))    |2024.01.24         |   support Chinese and stronger OCR   |
+## Performance
 \* Proprietary Model   &nbsp;&nbsp;&nbsp;&nbsp;   † Training Set Observed
 - Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA results have been corrected.
 ## Model Usage