Update README.md
Browse files
README.md
CHANGED
|
@@ -11,25 +11,47 @@ pipeline_tag: visual-question-answering
|
|
| 11 |
---
|
| 12 |
|
| 13 |
# Model Card for InternVL-Chat-V1.2-Plus
|
| 14 |
-
|
| 15 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/X8AXMkOlKeUpNcoJIXKna.webp" alt="Image Description" width="300" height="300">
|
|
|
|
| 16 |
|
| 17 |
\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
|
| 18 |
|
| 19 |
-
|
| 20 |
-
| ----------------------- | ---------- | --------------------------------------------------------------------------- | ---------------------------------- |
|
| 21 |
-
| InternVL-Chat-V1.5 | 2024.04.18 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5) | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
|
| 22 |
-
| InternVL-Chat-V1.2-Plus | 2024.02.21 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) | more SFT data and stronger |
|
| 23 |
-
| InternVL-Chat-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) | scaling up LLM to 34B |
|
| 24 |
-
| InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1) | support Chinese and stronger OCR |
|
| 25 |
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
##
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
\* Proprietary Model † Training Set Observed
|
| 35 |
|
|
@@ -49,21 +71,6 @@ InternVL-Chat-V1.2-Plus uses the same model architecture as [InternVL-Chat-V1.2]
|
|
| 49 |
- Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA results have been corrected.
|
| 50 |
|
| 51 |
|
| 52 |
-
## Model Details
|
| 53 |
-
- **Model Type:** multimodal large language model (MLLM)
|
| 54 |
-
- **Model Stats:**
|
| 55 |
-
- Architecture: [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)
|
| 56 |
-
- Image size: 448 x 448 (256 tokens)
|
| 57 |
-
- Params: 40B
|
| 58 |
-
|
| 59 |
-
- **Training Strategy:**
|
| 60 |
-
- Pretraining Stage
|
| 61 |
-
- Learnable Component: MLP
|
| 62 |
-
- Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
| 63 |
-
- Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
| 64 |
-
- Supervised Finetuning Stage
|
| 65 |
-
- Learnable Component: ViT + MLP + LLM
|
| 66 |
-
- Data: 12 million SFT samples.
|
| 67 |
|
| 68 |
|
| 69 |
## Model Usage
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
# Model Card for InternVL-Chat-V1.2-Plus
|
| 14 |
+
<p align="center">
|
| 15 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/X8AXMkOlKeUpNcoJIXKna.webp" alt="Image Description" width="300" height="300">
|
| 16 |
+
</p>
|
| 17 |
|
| 18 |
\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
|
| 19 |
|
| 20 |
+
InternVL-Chat-V1.2-Plus uses the same model architecture as [InternVL-Chat-V1.2](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2), but the difference lies in the SFT dataset. InternVL-Chat-V1.2 only utilizes an SFT dataset with 1.2M samples, while **our plus version employs an SFT dataset with 12M samples**.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
+
<p align="center">
|
| 23 |
+
<img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
|
| 24 |
+
</p>
|
| 25 |
|
| 26 |
+
## Model Details
|
| 27 |
+
- **Model Type:** multimodal large language model (MLLM)
|
| 28 |
+
- **Model Stats:**
|
| 29 |
+
- Architecture: [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)
|
| 30 |
+
- Image size: 448 x 448 (256 tokens)
|
| 31 |
+
- Params: 40B
|
| 32 |
+
|
| 33 |
+
- **Training Strategy:**
|
| 34 |
+
- Pretraining Stage
|
| 35 |
+
- Learnable Component: MLP
|
| 36 |
+
- Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
| 37 |
+
- Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
| 38 |
+
- Supervised Finetuning Stage
|
| 39 |
+
- Learnable Component: ViT + MLP + LLM
|
| 40 |
+
- Data: 12 million SFT samples.
|
| 41 |
+
|
| 42 |
+
## Released Models
|
| 43 |
|
| 44 |
+
| Model | Vision Foundation Model | Release Date |Note |
|
| 45 |
+
| :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
|
| 46 |
+
| InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
|
| 47 |
+
| InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.21 | more SFT data and stronger |
|
| 48 |
+
| InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.11 | scaling up LLM to 34B |
|
| 49 |
+
| InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |2024.01.24 | support Chinese and stronger OCR |
|
| 50 |
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
## Performance
|
| 55 |
|
| 56 |
\* Proprietary Model † Training Set Observed
|
| 57 |
|
|
|
|
| 71 |
- Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA results have been corrected.
|
| 72 |
|
| 73 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
|
| 76 |
## Model Usage
|