Alpha-VLLM
/

SPHINX

Model card Files Files and versions

xet

Community

void0721 commited on Nov 3, 2023

Commit

b1ad2db

1 Parent(s): 38cb900

Update README.md

Browse files

Files changed (1) hide show

README.md +50 -22

README.md CHANGED Viewed

@@ -7,8 +7,10 @@ Official implementation of ['SPHINX: A Mixer of Tasks, Domains, and Embeddings A
 Try out our [web demo 🚀](http://imagebind-llm.opengvlab.com/) here!
-## News
-* **[2023-10-17]** We release the demo, code, and model of SPHINX 🎉.
 ## Introduction
@@ -23,37 +25,21 @@ We present SPHINX, a versatile multi-modal large language model (MLLM) with a mi
 <p align="left">
   <img src="figs/pipeline1.png"/ width="100%"> <br>
 </p>
 <p align="left">
   <img src="figs/pipeline2.png"/ width="100%"> <br>
 </p>
-## Result
-**Evaluation Prompt Design**
-<p align="left">
-  <img src="figs/table1.png"/ width="100%"> <br>
-</p>
-**Benchmarks on Multimodal Large Language Models**
-<p align="left">
-  <img src="figs/table2.png"/ width="100%"> <br>
-</p
-**Visual Question Answering**
-<p align="left">
-  <img src="figs/table3.png"/ width="100%"> <br>
-</p>
-**Visual Grounding**
-<p align="left">
-  <img src="figs/table4.png"/ width="100%"> <br>
-</p>
 ## Inference
 This section provides a step-by-step guide for hosting a local SPHINX demo. If you're already familiar with the LLAMA2-Accessory toolkit, note that hosting a SPHINX demo follows the same pipeline as hosting demos for the other models supported by LLAMA2-Accessory.
-### Installation
-SPHINX is built upon LLaMA2-Accessory, please follow the instructions [here](https://llama2-accessory.readthedocs.io/en/latest/install.html) for environment setup.
 ### Weights
 We provide the beta-version checkpoints on [HuggingFace🤗](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/tree/main/finetune/mm/sphinx-sft). Please download them to your own machine. The file structure should appear as follows:
@@ -77,3 +63,45 @@ Explanation of each argument:
 + `--tokenizer_path`: Path to the official LLaMA2 tokenizer. Note that the tokenizer file is the same for both LLaMA and LLaMA2. You may download it from [here](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/blob/main/config/tokenizer.model).
 + `--llama_type`: The model architecture of SPHINX is defined in [accessory/model/LLM/llama_ens.py](../accessory/model/LLM/llama_ens.py),  and specifying `--llama_type=llama_ens ` tells the demo program to use this architecture.
 + `--pretrained_path`: The path to pre-trained checkpoint.

 Try out our [web demo 🚀](http://imagebind-llm.opengvlab.com/) here!
+<p align="left">
+   Github link: <a href="https://huggingface.co/Alpha-VLLM/SPHINX" target="_blank">Github</a> • 👋 join our <a href="https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/docs/wechat.md" target="_blank">WeChat</a>
+</p>
 ## Introduction
 <p align="left">
   <img src="figs/pipeline1.png"/ width="100%"> <br>
 </p>
+On top of SPHINX, we propose to further mixvisual scales and sub-images for better capture fine-grained semantics on high-resolution images.
 <p align="left">
   <img src="figs/pipeline2.png"/ width="100%"> <br>
 </p>
+### Installation
+SPHINX is built upon LLaMA2-Accessory, please follow the instructions [here](https://llama2-accessory.readthedocs.io/en/latest/install.html) for environment setup.
 ## Inference
 This section provides a step-by-step guide for hosting a local SPHINX demo. If you're already familiar with the LLAMA2-Accessory toolkit, note that hosting a SPHINX demo follows the same pipeline as hosting demos for the other models supported by LLAMA2-Accessory.
 ### Weights
 We provide the beta-version checkpoints on [HuggingFace🤗](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/tree/main/finetune/mm/sphinx-sft). Please download them to your own machine. The file structure should appear as follows:
 + `--tokenizer_path`: Path to the official LLaMA2 tokenizer. Note that the tokenizer file is the same for both LLaMA and LLaMA2. You may download it from [here](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/blob/main/config/tokenizer.model).
 + `--llama_type`: The model architecture of SPHINX is defined in [accessory/model/LLM/llama_ens.py](../accessory/model/LLM/llama_ens.py),  and specifying `--llama_type=llama_ens ` tells the demo program to use this architecture.
 + `--pretrained_path`: The path to pre-trained checkpoint.
+## Result
+We provide a comprehensive evaluation of SPHINX and showcase results across multiple benchmarks.
+Our evaluation encompasses both **quantitative metrics** and **qualitative assessments**, providing a holistic understanding of our VLM model's performance.
+**Evaluation Prompt Design**
+<p align="left">
+  <img src="figs/table1.png"/ width="100%"> <br>
+</p>
+* In evaluation, we prioritize aligning with each benchmark's desired output format.
+* We employ distinct prompts tailored to benchmarks that necessitate long answers, short answers, and multiple-choice responses.
+* For tasks involving visual grounding, we directly utilize the prompts during training to enhance the model's performance on these particular challenges.
+**Benchmarks on Multimodal Large Language Models**
+<p align="left">
+  <img src="figs/table2.png"/ width="100%"> <br>
+</p
+* We test our model on recently proposed MLLM benchmarks which is based on VQA to comprehensive evaluation of the model's characteristic such as MME, Seedbench, POPE, LLaVA-Bench (In-the-Wild), MM-Vet, MathVista, MMbench, CCbench.
+* The Long-SPHINX achieve new stat of arts result on 5 out of 9 benchmarks
+**Visual Question Answering**
+<p align="left">
+  <img src="figs/table3.png"/ width="100%"> <br>
+</p>
+* We evaluate general VQA benchmarks, such as VQAV2, OKVQA, GQA, vizwiz, scienceQA, visual spatial reasoning (VSR), IconQA.
+* Additionally, we conduct experiments on Text-oriented VQA such as TextVQA,OCR-VQA.
+* Long-Sphinx achieve comparative results across all benchmarks. We observe that Long-Sphinx outperforms Sphinx in VQA datasets that demand fine-grained visual information, showcasing the effectiveness of our visual mixed-up approach for achieving high resolution without relying on a visual encoder trained specifically on high-resolution images.
+**Visual Grounding**
+<p align="left">
+  <img src="figs/table4.png"/ width="100%"> <br>
+</p>
+* The SPHINX model and baseline models on REC benchmarks results on table4.
+* SPHINX exhibits robust performance in visual grounding tasks such as RefCOCO, RefCOCO+, and RefCOCOg, **surpassing other vision-language generalist models**.
+* Notably, SPHINX outperforms specialist models G-DINO-L by **more than 1.54%** in accuracy across all tasks within RefCOCO/RefCOCO+/RefCOCOg.