Update README.md
Browse files
README.md
CHANGED
|
@@ -7,8 +7,10 @@ Official implementation of ['SPHINX: A Mixer of Tasks, Domains, and Embeddings A
|
|
| 7 |
|
| 8 |
Try out our [web demo 🚀](http://imagebind-llm.opengvlab.com/) here!
|
| 9 |
|
| 10 |
-
|
| 11 |
-
|
|
|
|
|
|
|
| 12 |
|
| 13 |
## Introduction
|
| 14 |
|
|
@@ -23,37 +25,21 @@ We present SPHINX, a versatile multi-modal large language model (MLLM) with a mi
|
|
| 23 |
<p align="left">
|
| 24 |
<img src="figs/pipeline1.png"/ width="100%"> <br>
|
| 25 |
</p>
|
|
|
|
|
|
|
| 26 |
<p align="left">
|
| 27 |
<img src="figs/pipeline2.png"/ width="100%"> <br>
|
| 28 |
</p>
|
| 29 |
|
| 30 |
-
## Result
|
| 31 |
-
|
| 32 |
-
**Evaluation Prompt Design**
|
| 33 |
-
<p align="left">
|
| 34 |
-
<img src="figs/table1.png"/ width="100%"> <br>
|
| 35 |
-
</p>
|
| 36 |
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
<img src="figs/table2.png"/ width="100%"> <br>
|
| 40 |
-
</p
|
| 41 |
|
| 42 |
-
**Visual Question Answering**
|
| 43 |
-
<p align="left">
|
| 44 |
-
<img src="figs/table3.png"/ width="100%"> <br>
|
| 45 |
-
</p>
|
| 46 |
|
| 47 |
-
**Visual Grounding**
|
| 48 |
-
<p align="left">
|
| 49 |
-
<img src="figs/table4.png"/ width="100%"> <br>
|
| 50 |
-
</p>
|
| 51 |
|
| 52 |
## Inference
|
| 53 |
This section provides a step-by-step guide for hosting a local SPHINX demo. If you're already familiar with the LLAMA2-Accessory toolkit, note that hosting a SPHINX demo follows the same pipeline as hosting demos for the other models supported by LLAMA2-Accessory.
|
| 54 |
|
| 55 |
-
### Installation
|
| 56 |
-
SPHINX is built upon LLaMA2-Accessory, please follow the instructions [here](https://llama2-accessory.readthedocs.io/en/latest/install.html) for environment setup.
|
| 57 |
|
| 58 |
### Weights
|
| 59 |
We provide the beta-version checkpoints on [HuggingFace🤗](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/tree/main/finetune/mm/sphinx-sft). Please download them to your own machine. The file structure should appear as follows:
|
|
@@ -77,3 +63,45 @@ Explanation of each argument:
|
|
| 77 |
+ `--tokenizer_path`: Path to the official LLaMA2 tokenizer. Note that the tokenizer file is the same for both LLaMA and LLaMA2. You may download it from [here](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/blob/main/config/tokenizer.model).
|
| 78 |
+ `--llama_type`: The model architecture of SPHINX is defined in [accessory/model/LLM/llama_ens.py](../accessory/model/LLM/llama_ens.py), and specifying `--llama_type=llama_ens ` tells the demo program to use this architecture.
|
| 79 |
+ `--pretrained_path`: The path to pre-trained checkpoint.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
Try out our [web demo 🚀](http://imagebind-llm.opengvlab.com/) here!
|
| 9 |
|
| 10 |
+
<p align="left">
|
| 11 |
+
Github link: <a href="https://huggingface.co/Alpha-VLLM/SPHINX" target="_blank">Github</a> • 👋 join our <a href="https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/docs/wechat.md" target="_blank">WeChat</a>
|
| 12 |
+
</p>
|
| 13 |
+
|
| 14 |
|
| 15 |
## Introduction
|
| 16 |
|
|
|
|
| 25 |
<p align="left">
|
| 26 |
<img src="figs/pipeline1.png"/ width="100%"> <br>
|
| 27 |
</p>
|
| 28 |
+
|
| 29 |
+
On top of SPHINX, we propose to further mixvisual scales and sub-images for better capture fine-grained semantics on high-resolution images.
|
| 30 |
<p align="left">
|
| 31 |
<img src="figs/pipeline2.png"/ width="100%"> <br>
|
| 32 |
</p>
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
### Installation
|
| 36 |
+
SPHINX is built upon LLaMA2-Accessory, please follow the instructions [here](https://llama2-accessory.readthedocs.io/en/latest/install.html) for environment setup.
|
|
|
|
|
|
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
## Inference
|
| 41 |
This section provides a step-by-step guide for hosting a local SPHINX demo. If you're already familiar with the LLAMA2-Accessory toolkit, note that hosting a SPHINX demo follows the same pipeline as hosting demos for the other models supported by LLAMA2-Accessory.
|
| 42 |
|
|
|
|
|
|
|
| 43 |
|
| 44 |
### Weights
|
| 45 |
We provide the beta-version checkpoints on [HuggingFace🤗](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/tree/main/finetune/mm/sphinx-sft). Please download them to your own machine. The file structure should appear as follows:
|
|
|
|
| 63 |
+ `--tokenizer_path`: Path to the official LLaMA2 tokenizer. Note that the tokenizer file is the same for both LLaMA and LLaMA2. You may download it from [here](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/blob/main/config/tokenizer.model).
|
| 64 |
+ `--llama_type`: The model architecture of SPHINX is defined in [accessory/model/LLM/llama_ens.py](../accessory/model/LLM/llama_ens.py), and specifying `--llama_type=llama_ens ` tells the demo program to use this architecture.
|
| 65 |
+ `--pretrained_path`: The path to pre-trained checkpoint.
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
## Result
|
| 69 |
+
|
| 70 |
+
We provide a comprehensive evaluation of SPHINX and showcase results across multiple benchmarks.
|
| 71 |
+
|
| 72 |
+
Our evaluation encompasses both **quantitative metrics** and **qualitative assessments**, providing a holistic understanding of our VLM model's performance.
|
| 73 |
+
|
| 74 |
+
**Evaluation Prompt Design**
|
| 75 |
+
<p align="left">
|
| 76 |
+
<img src="figs/table1.png"/ width="100%"> <br>
|
| 77 |
+
</p>
|
| 78 |
+
|
| 79 |
+
* In evaluation, we prioritize aligning with each benchmark's desired output format.
|
| 80 |
+
* We employ distinct prompts tailored to benchmarks that necessitate long answers, short answers, and multiple-choice responses.
|
| 81 |
+
* For tasks involving visual grounding, we directly utilize the prompts during training to enhance the model's performance on these particular challenges.
|
| 82 |
+
|
| 83 |
+
**Benchmarks on Multimodal Large Language Models**
|
| 84 |
+
<p align="left">
|
| 85 |
+
<img src="figs/table2.png"/ width="100%"> <br>
|
| 86 |
+
</p
|
| 87 |
+
|
| 88 |
+
* We test our model on recently proposed MLLM benchmarks which is based on VQA to comprehensive evaluation of the model's characteristic such as MME, Seedbench, POPE, LLaVA-Bench (In-the-Wild), MM-Vet, MathVista, MMbench, CCbench.
|
| 89 |
+
* The Long-SPHINX achieve new stat of arts result on 5 out of 9 benchmarks
|
| 90 |
+
|
| 91 |
+
**Visual Question Answering**
|
| 92 |
+
<p align="left">
|
| 93 |
+
<img src="figs/table3.png"/ width="100%"> <br>
|
| 94 |
+
</p>
|
| 95 |
+
|
| 96 |
+
* We evaluate general VQA benchmarks, such as VQAV2, OKVQA, GQA, vizwiz, scienceQA, visual spatial reasoning (VSR), IconQA.
|
| 97 |
+
* Additionally, we conduct experiments on Text-oriented VQA such as TextVQA,OCR-VQA.
|
| 98 |
+
* Long-Sphinx achieve comparative results across all benchmarks. We observe that Long-Sphinx outperforms Sphinx in VQA datasets that demand fine-grained visual information, showcasing the effectiveness of our visual mixed-up approach for achieving high resolution without relying on a visual encoder trained specifically on high-resolution images.
|
| 99 |
+
|
| 100 |
+
**Visual Grounding**
|
| 101 |
+
<p align="left">
|
| 102 |
+
<img src="figs/table4.png"/ width="100%"> <br>
|
| 103 |
+
</p>
|
| 104 |
+
|
| 105 |
+
* The SPHINX model and baseline models on REC benchmarks results on table4.
|
| 106 |
+
* SPHINX exhibits robust performance in visual grounding tasks such as RefCOCO, RefCOCO+, and RefCOCOg, **surpassing other vision-language generalist models**.
|
| 107 |
+
* Notably, SPHINX outperforms specialist models G-DINO-L by **more than 1.54%** in accuracy across all tasks within RefCOCO/RefCOCO+/RefCOCOg.
|