OpenGVLab
/

InternVL-Chat-V1-1

Image-Text-to-Text

feature-extraction

Model card Files Files and versions

czczup commited on Jan 26, 2024

Commit

162cb94

·

verified ·

1 Parent(s): f564018

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -31,7 +31,7 @@ It is _**the largest open-source vision/vision-language foundation model (14B)**
 - **Training Strategy:**
   - Pretraining Stage
-    - Learnable Component: InternViT-6B
     - Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
     - Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
   - SFT Stage

 - **Training Strategy:**
   - Pretraining Stage
+    - Learnable Component: InternViT-6B + MLP
     - Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
     - Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
   - SFT Stage