Update README.md
Browse files
README.md
CHANGED
|
@@ -31,7 +31,7 @@ It is _**the largest open-source vision/vision-language foundation model (14B)**
|
|
| 31 |
|
| 32 |
- **Training Strategy:**
|
| 33 |
- Pretraining Stage
|
| 34 |
-
- Learnable Component: InternViT-6B
|
| 35 |
- Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
| 36 |
- Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
| 37 |
- SFT Stage
|
|
|
|
| 31 |
|
| 32 |
- **Training Strategy:**
|
| 33 |
- Pretraining Stage
|
| 34 |
+
- Learnable Component: InternViT-6B + MLP
|
| 35 |
- Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
| 36 |
- Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
| 37 |
- SFT Stage
|