Image-Text-to-Text
Transformers
PyTorch
English
llava
image-to-text
1-bit
VLA
VLM
conversational
hongyuw commited on
Commit
0c58841
·
verified ·
1 Parent(s): 65b1a26

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -3
README.md CHANGED
@@ -1,3 +1,71 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
5
+ - liuhaotian/LLaVA-Pretrain
6
+ language:
7
+ - en
8
+ metrics:
9
+ - accuracy
10
+ base_model:
11
+ - microsoft/bitnet-b1.58-2B-4T
12
+ pipeline_tag: image-text-to-text
13
+ tags:
14
+ - 1-bit
15
+ - VLA
16
+ - VLM
17
+ ---
18
+ # BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
19
+ [[paper]]() [[model]](https://huggingface.co/collections/hongyuw/bitvla-68468fb1e3aae15dd8a4e36e) [[code]](https://github.com/ustcwhy/BitVLA)
20
+
21
+ - June 2025: [BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation]()
22
+
23
+
24
+ ## Open Source Plan
25
+
26
+ - ✅ Paper, Pre-trained VLM and evaluation code.
27
+ - 🧭 Fine-tuned VLA models, pre-training and fine-tuning code.
28
+ - 🧭 Pre-trained VLA.
29
+
30
+
31
+ ## Evaluation on VQA
32
+
33
+ We use the [LMM-Eval](https://github.com/ustcwhy/BitVLA/tree/main/lmms-eval) toolkit to conduct evaluations on VQA tasks. We provide the [transformers repo](https://github.com/ustcwhy/BitVLA/tree/main/transformers) in which we modify the [modeling_llava.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/llava/modeling_llava.py) and [modeling_siglip.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/siglip/modeling_siglip.py) to support the W1.58-A8 quantization.
34
+
35
+ The evaluation should use nvidia_24_07 docker. Install the packages:
36
+
37
+ ```bash
38
+ docker run --name nvidia_24_07 --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity # only use for multimodal evaluation
39
+ docker exec -it nvidia_24_07 bash
40
+ git clone https://github.com/ustcwhy/BitVLA.git
41
+ cd BitVLA/
42
+ bash vl_eval_setup.sh # only use for multimodal evaluation
43
+ ```
44
+
45
+ First, download the BitVLA model from HuggingFace:
46
+
47
+ ```bash
48
+ git clone https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16 # BitVLA w/ W1.58-A8 SigLIP-L
49
+ git clone https://huggingface.co/hongyuw/bitvla-siglipL-224px-bf16 # BitVLA w/ BF16 SigLIP-L
50
+ ```
51
+
52
+ Then run the following scripts to conduct evaluations:
53
+
54
+ ```bash
55
+ cd lmms-eval/
56
+ bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-bitsiglipL-224px-bf16
57
+ bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-siglipL-224px-bf16
58
+ ```
59
+
60
+ Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost.
61
+
62
+ ## Acknowledgement
63
+
64
+ This repository is built using [LMM-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) and [the HuggingFace's transformers](https://github.com/huggingface/transformers).
65
+
66
+ ## License
67
+ This project is licensed under the MIT License.
68
+
69
+ ### Contact Information
70
+
71
+ For help or issues using models, please submit a GitHub issue.