File size: 1,756 Bytes
cc16e9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
license: other
language:
- ja
base_model:
- tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: visual-question-answering

---

# Llama-3.1-70B-Instruct-multimodal-JP-Graph - Built with Llama

Llama-3.1-70B-Instruct-multimodal-JP-Graph is a Japanese Large Vision Language Model.
This model is based on [Llama-3.1-Swallow-70B](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3) and Image Encoder of [Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).

# How to use
### 1. Install LLaVA-NeXT

- First, please install LLaVA-NeXT by following the instructions at the [URL](https://github.com/LLaVA-VL/LLaVA-NeXT).

```sh
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"
```

### 2. Install dependencies
```sh
pip install flash-attn==2.6.3
pip install transformers==4.45.2
```

### 3. Modify LLaVA-NeXT
- Modify the LLaVA-NeXT code as follows.
  - Create the LLaVA-NeXT/llava/model/multimodal_encoder/qwen2_vl directory and copy the contents of the attached qwen2_vl directory into it.
  - Overwrite LLaVA-NeXT/llava/model/multimodal_encoder/builder.py with the attached "builder.py".
  - Copy the attached "qwen2vl_encoder.py" into LLaVA-NeXT/llava/model/multimodal_encoder/.
  - Overwrite LLaVA-NeXT/llava/model/language_model/llava_llama.py with the attached "llava_llama.py".
  - Overwrite LLaVA-NeXT/llava/model/llava_arch.py with the attached "llava_arch.py".
  - Overwrite LLaVA-NeXT/llava/conversation.py with the attached "conversation.py".

### 4.  Inference
The following script loads the model and allows inference.