[email protected]
commited on
Commit
·
b988ade
1
Parent(s):
732e074
Update readme
Browse files
README.md
CHANGED
|
@@ -22,7 +22,7 @@ library_name: transformers
|
|
| 22 |
|
| 23 |
Today (September 17th, 2024), we introduce [NVLM 1.0](https://arxiv.org/abs/2409.11402), a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training.
|
| 24 |
|
| 25 |
-
In this repo, we are open-sourcing NVLM-1.0-D-72B (decoder-only architecture), the decoder-only model weights and code for the community.
|
| 26 |
|
| 27 |
## Other Resources
|
| 28 |
[Inference Code (HF)](https://huggingface.co/nvidia/NVLM-D-72B/tree/main)   [Training Code (Coming soon)]()   [Website](https://research.nvidia.com/labs/adlr/NVLM-1/)   [Paper](https://arxiv.org/abs/2409.11402)
|
|
@@ -75,7 +75,11 @@ Results (as of September 17th, 2024) in the multimodal benchmarks are as follows
|
|
| 75 |
|
| 76 |
## How to use
|
| 77 |
|
| 78 |
-
When converting Megatron checkpoint to Huggingface, we adapt [InternVL codebase](https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B) to support model loading and multi-GPU inference in HF.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
### Prepare the environment
|
| 81 |
|
|
@@ -304,4 +308,4 @@ Wenliang Dai* ([email protected]), Nayeon Lee* ([email protected]), Boxin Wang* (
|
|
| 304 |
|
| 305 |
|
| 306 |
## License
|
| 307 |
-
The use of this model is governed by the [cc-by-nc-4.0](https://spdx.org/licenses/CC-BY-NC-4.0)
|
|
|
|
| 22 |
|
| 23 |
Today (September 17th, 2024), we introduce [NVLM 1.0](https://arxiv.org/abs/2409.11402), a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training.
|
| 24 |
|
| 25 |
+
In this repo, we are open-sourcing NVLM-1.0-D-72B (decoder-only architecture), the decoder-only model weights and code for the community.
|
| 26 |
|
| 27 |
## Other Resources
|
| 28 |
[Inference Code (HF)](https://huggingface.co/nvidia/NVLM-D-72B/tree/main)   [Training Code (Coming soon)]()   [Website](https://research.nvidia.com/labs/adlr/NVLM-1/)   [Paper](https://arxiv.org/abs/2409.11402)
|
|
|
|
| 75 |
|
| 76 |
## How to use
|
| 77 |
|
| 78 |
+
When converting Megatron checkpoint to Huggingface, we adapt [InternVL codebase](https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B) to support model loading and multi-GPU inference in HF.
|
| 79 |
+
We also use the tokenizer from [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/tree/main) when adapting the tokenizer to Huggingface, as it contains extra special tokens for vision tasks, e.g., `<|vision_pad|>`.
|
| 80 |
+
We train NVLM-1.0-D-72B based on the [Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct/tree/main) text-only model and [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) ViT model with our large-scale high-quality multimodal dataset.
|
| 81 |
+
For training code, please refer to [Megatron-LM (Coming soon)]().
|
| 82 |
+
|
| 83 |
|
| 84 |
### Prepare the environment
|
| 85 |
|
|
|
|
| 308 |
|
| 309 |
|
| 310 |
## License
|
| 311 |
+
The use of this model is governed by the [cc-by-nc-4.0](https://spdx.org/licenses/CC-BY-NC-4.0)
|