Update README.md
Browse files
README.md
CHANGED
@@ -18,7 +18,6 @@ library_name: transformers
|
|
18 |
---
|
19 |
|
20 |
# BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
|
21 |
-
|
22 |
[[paper]](https://arxiv.org/abs/2506.07530) [[model]](https://huggingface.co/collections/hongyuw/bitvla-68468fb1e3aae15dd8a4e36e) [[code]](https://github.com/ustcwhy/BitVLA)
|
23 |
|
24 |
- June 2025: [BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation](https://arxiv.org/abs/2506.07530)
|
@@ -27,11 +26,44 @@ library_name: transformers
|
|
27 |
## Open Source Plan
|
28 |
|
29 |
- ✅ Paper, Pre-trained VLM and evaluation code.
|
30 |
-
-
|
31 |
-
- 🧭 Pre-
|
32 |
-
|
33 |
-
|
34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
|
36 |
We use the [LMM-Eval](https://github.com/ustcwhy/BitVLA/tree/main/lmms-eval) toolkit to conduct evaluations on VQA tasks. We provide the [transformers repo](https://github.com/ustcwhy/BitVLA/tree/main/transformers) in which we modify the [modeling_llava.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/llava/modeling_llava.py) and [modeling_siglip.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/siglip/modeling_siglip.py) to support the W1.58-A8 quantization.
|
37 |
|
@@ -62,12 +94,89 @@ bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-siglipL-224px-bf16
|
|
62 |
|
63 |
Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost.
|
64 |
|
65 |
-
##
|
66 |
|
67 |
-
|
68 |
|
69 |
-
|
70 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
71 |
|
72 |
## Citation
|
73 |
|
@@ -83,6 +192,9 @@ If you find this repository useful, please consider citing our work:
|
|
83 |
}
|
84 |
```
|
85 |
|
|
|
|
|
|
|
86 |
### Contact Information
|
87 |
|
88 |
-
For help or issues using models, please submit a GitHub issue.
|
|
|
18 |
---
|
19 |
|
20 |
# BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
|
|
|
21 |
[[paper]](https://arxiv.org/abs/2506.07530) [[model]](https://huggingface.co/collections/hongyuw/bitvla-68468fb1e3aae15dd8a4e36e) [[code]](https://github.com/ustcwhy/BitVLA)
|
22 |
|
23 |
- June 2025: [BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation](https://arxiv.org/abs/2506.07530)
|
|
|
26 |
## Open Source Plan
|
27 |
|
28 |
- ✅ Paper, Pre-trained VLM and evaluation code.
|
29 |
+
- ✅ Fine-tuned VLA code and models
|
30 |
+
- 🧭 Pre-training code and VLA.
|
31 |
+
|
32 |
+
## Contents
|
33 |
+
|
34 |
+
- [BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation](#bitvla-1-bit-vision-language-action-models-for-robotics-manipulation)
|
35 |
+
- [Contents](#contents)
|
36 |
+
- [Checkpoints](#checkpoints)
|
37 |
+
- [Vision-Language](#vision-language)
|
38 |
+
- [Evaluation on VQA](#evaluation-on-vqa)
|
39 |
+
- [Vision-Language-Action](#vision-language-action)
|
40 |
+
- [OFT Training](#oft-training)
|
41 |
+
- [1. Preparing OFT](#1-preparing-oft)
|
42 |
+
- [2. OFT fine-tuning](#2-oft-fine-tuning)
|
43 |
+
- [Evaluation on LIBERO](#evaluation-on-libero)
|
44 |
+
- [Acknowledgement](#acknowledgement)
|
45 |
+
- [Citation](#citation)
|
46 |
+
- [License](#license)
|
47 |
+
- [Contact Information](#contact-information)
|
48 |
+
|
49 |
+
## Checkpoints
|
50 |
+
|
51 |
+
| Model | Path |
|
52 |
+
| -------------- | ----- |
|
53 |
+
| BitVLA | [hongyuw/bitvla-bitsiglipL-224px-bf16](https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16) |
|
54 |
+
| BitVLA finetuned on LIBERO-Spatial | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16) |
|
55 |
+
| BitVLA finetuned on LIBERO-Object | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_object-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_object-bf16) |
|
56 |
+
| BitVLA finetuned on LIBERO-Goal | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16) |
|
57 |
+
| BitVLA finetuned on LIBERO-Long | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16) |
|
58 |
+
| BitVLA w/ BF16 SigLIP | [hongyuw/bitvla-siglipL-224px-bf16](https://huggingface.co/hongyuw/bitvla-siglipL-224px-bf16) |
|
59 |
+
|
60 |
+
*Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost.*
|
61 |
+
|
62 |
+
*Due to limited resources, we have not yet pre-trained BitVLA on a large-scale robotics dataset. We are actively working to secure additional compute resources to conduct this pre-training.*
|
63 |
+
|
64 |
+
## Vision-Language
|
65 |
+
|
66 |
+
### Evaluation on VQA
|
67 |
|
68 |
We use the [LMM-Eval](https://github.com/ustcwhy/BitVLA/tree/main/lmms-eval) toolkit to conduct evaluations on VQA tasks. We provide the [transformers repo](https://github.com/ustcwhy/BitVLA/tree/main/transformers) in which we modify the [modeling_llava.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/llava/modeling_llava.py) and [modeling_siglip.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/siglip/modeling_siglip.py) to support the W1.58-A8 quantization.
|
69 |
|
|
|
94 |
|
95 |
Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost.
|
96 |
|
97 |
+
## Vision-Language-Action
|
98 |
|
99 |
+
### OFT Training
|
100 |
|
101 |
+
#### 1. Preparing OFT
|
102 |
+
We fine-tune BitVLA using OFT training shown in [OpenVLA-OFT](https://github.com/moojink/openvla-oft/tree/main). First setup the environment as required by that project. You can refer to [SETUP.md](https://github.com/moojink/openvla-oft/blob/main/SETUP.md) and [LIBERO.md](https://github.com/moojink/openvla-oft/blob/main/LIBERO.md) for detailed instructions.
|
103 |
+
|
104 |
+
```
|
105 |
+
conda create -n bitvla python=3.10 -y
|
106 |
+
conda activate bitvla
|
107 |
+
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124
|
108 |
+
|
109 |
+
# or use the provided docker
|
110 |
+
# docker run --name nvidia_24_07 --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity
|
111 |
+
|
112 |
+
cd BitVLA
|
113 |
+
pip install -e openvla-oft/
|
114 |
+
pip install -e transformers
|
115 |
+
|
116 |
+
cd openvla-oft/
|
117 |
+
|
118 |
+
# install LIBERO
|
119 |
+
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
|
120 |
+
pip install -e LIBERO/
|
121 |
+
# in BitVLA
|
122 |
+
pip install -r experiments/robot/libero/libero_requirements.txt
|
123 |
+
|
124 |
+
# install bitvla
|
125 |
+
pip install -e bitvla/
|
126 |
+
```
|
127 |
+
|
128 |
+
We adopt the same dataset as OpenVLA-OFT for the fine-tuning on LIBERO. You can download the dataset from [HuggingFace](https://huggingface.co/datasets/openvla/modified_libero_rlds).
|
129 |
+
|
130 |
+
```
|
131 |
+
git clone [email protected]:datasets/openvla/modified_libero_rlds
|
132 |
+
```
|
133 |
+
|
134 |
+
#### 2. OFT fine-tuning
|
135 |
+
|
136 |
+
First convert the [BitVLA](https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16) to a format compatible with the VLA codebase.
|
137 |
+
|
138 |
+
```
|
139 |
+
python convert_ckpt.py /path/to/bitvla-bitsiglipL-224px-bf16
|
140 |
+
```
|
141 |
+
|
142 |
+
After that, you can finetune the BitVLA using the following command. Here we take LIBERO spatial as an example:
|
143 |
+
|
144 |
+
```
|
145 |
+
torchrun --standalone --nnodes 1 --nproc-per-node 4 vla-scripts/finetune_bitnet.py \
|
146 |
+
--vla_path /path/to/bitvla-bitsiglipL-224px-bf16 \
|
147 |
+
--data_root_dir /path/to/modified_libero_rlds/ \
|
148 |
+
--dataset_name libero_spatial_no_noops \
|
149 |
+
--run_root_dir /path/to/save/your/ckpt \
|
150 |
+
--use_l1_regression True \
|
151 |
+
--warmup_steps 375 \
|
152 |
+
--use_lora False \
|
153 |
+
--num_images_in_input 2 \
|
154 |
+
--use_proprio True \
|
155 |
+
--batch_size 2 \
|
156 |
+
--grad_accumulation_steps 8 \
|
157 |
+
--learning_rate 1e-4 \
|
158 |
+
--max_steps 10001 \
|
159 |
+
--save_freq 10000 \
|
160 |
+
--save_latest_checkpoint_only False \
|
161 |
+
--image_aug True \
|
162 |
+
--run_id_note your_id
|
163 |
+
```
|
164 |
+
|
165 |
+
### Evaluation on LIBERO
|
166 |
+
|
167 |
+
You can download our fine-tuned BitVLA models from [HuggingFace](https://huggingface.co/collections/hongyuw/bitvla-68468fb1e3aae15dd8a4e36e). As an example for spatial set in LIBERO, run the following script for evaluation:
|
168 |
+
|
169 |
+
```
|
170 |
+
python experiments/robot/libero/run_libero_eval_bitnet.py \
|
171 |
+
--pretrained_checkpoint /path/to/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16 \
|
172 |
+
--task_suite_name libero_spatial \
|
173 |
+
--info_in_path "information you want to show in path" \
|
174 |
+
--model_family "bitnet"
|
175 |
+
```
|
176 |
+
|
177 |
+
## Acknowledgement
|
178 |
+
|
179 |
+
This repository is built using [LMM-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [the HuggingFace's transformers](https://github.com/huggingface/transformers) and [OpenVLA-OFT](https://github.com/moojink/openvla-oft).
|
180 |
|
181 |
## Citation
|
182 |
|
|
|
192 |
}
|
193 |
```
|
194 |
|
195 |
+
## License
|
196 |
+
This project is licensed under the MIT License.
|
197 |
+
|
198 |
### Contact Information
|
199 |
|
200 |
+
For help or issues using models, please submit a GitHub issue.
|