Image-Text-to-Text
Transformers
PyTorch
English
llava
image-to-text
1-bit
VLA
VLM
conversational
hongyuw commited on
Commit
813257e
·
verified ·
1 Parent(s): 2996f4d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -11
README.md CHANGED
@@ -18,7 +18,6 @@ library_name: transformers
18
  ---
19
 
20
  # BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
21
-
22
  [[paper]](https://arxiv.org/abs/2506.07530) [[model]](https://huggingface.co/collections/hongyuw/bitvla-68468fb1e3aae15dd8a4e36e) [[code]](https://github.com/ustcwhy/BitVLA)
23
 
24
  - June 2025: [BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation](https://arxiv.org/abs/2506.07530)
@@ -27,11 +26,44 @@ library_name: transformers
27
  ## Open Source Plan
28
 
29
  - ✅ Paper, Pre-trained VLM and evaluation code.
30
- - 🧭 Fine-tuned VLA models, pre-training and fine-tuning code.
31
- - 🧭 Pre-trained VLA.
32
-
33
-
34
- ## Evaluation on VQA
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  We use the [LMM-Eval](https://github.com/ustcwhy/BitVLA/tree/main/lmms-eval) toolkit to conduct evaluations on VQA tasks. We provide the [transformers repo](https://github.com/ustcwhy/BitVLA/tree/main/transformers) in which we modify the [modeling_llava.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/llava/modeling_llava.py) and [modeling_siglip.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/siglip/modeling_siglip.py) to support the W1.58-A8 quantization.
37
 
@@ -62,12 +94,89 @@ bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-siglipL-224px-bf16
62
 
63
  Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost.
64
 
65
- ## Acknowledgement
66
 
67
- This repository is built using [LMM-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) and [the HuggingFace's transformers](https://github.com/huggingface/transformers).
68
 
69
- ## License
70
- This project is licensed under the MIT License.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
  ## Citation
73
 
@@ -83,6 +192,9 @@ If you find this repository useful, please consider citing our work:
83
  }
84
  ```
85
 
 
 
 
86
  ### Contact Information
87
 
88
- For help or issues using models, please submit a GitHub issue.
 
18
  ---
19
 
20
  # BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
 
21
  [[paper]](https://arxiv.org/abs/2506.07530) [[model]](https://huggingface.co/collections/hongyuw/bitvla-68468fb1e3aae15dd8a4e36e) [[code]](https://github.com/ustcwhy/BitVLA)
22
 
23
  - June 2025: [BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation](https://arxiv.org/abs/2506.07530)
 
26
  ## Open Source Plan
27
 
28
  - ✅ Paper, Pre-trained VLM and evaluation code.
29
+ - Fine-tuned VLA code and models
30
+ - 🧭 Pre-training code and VLA.
31
+
32
+ ## Contents
33
+
34
+ - [BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation](#bitvla-1-bit-vision-language-action-models-for-robotics-manipulation)
35
+ - [Contents](#contents)
36
+ - [Checkpoints](#checkpoints)
37
+ - [Vision-Language](#vision-language)
38
+ - [Evaluation on VQA](#evaluation-on-vqa)
39
+ - [Vision-Language-Action](#vision-language-action)
40
+ - [OFT Training](#oft-training)
41
+ - [1. Preparing OFT](#1-preparing-oft)
42
+ - [2. OFT fine-tuning](#2-oft-fine-tuning)
43
+ - [Evaluation on LIBERO](#evaluation-on-libero)
44
+ - [Acknowledgement](#acknowledgement)
45
+ - [Citation](#citation)
46
+ - [License](#license)
47
+ - [Contact Information](#contact-information)
48
+
49
+ ## Checkpoints
50
+
51
+ | Model | Path |
52
+ | -------------- | ----- |
53
+ | BitVLA | [hongyuw/bitvla-bitsiglipL-224px-bf16](https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16) |
54
+ | BitVLA finetuned on LIBERO-Spatial | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16) |
55
+ | BitVLA finetuned on LIBERO-Object | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_object-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_object-bf16) |
56
+ | BitVLA finetuned on LIBERO-Goal | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16) |
57
+ | BitVLA finetuned on LIBERO-Long | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16) |
58
+ | BitVLA w/ BF16 SigLIP | [hongyuw/bitvla-siglipL-224px-bf16](https://huggingface.co/hongyuw/bitvla-siglipL-224px-bf16) |
59
+
60
+ *Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost.*
61
+
62
+ *Due to limited resources, we have not yet pre-trained BitVLA on a large-scale robotics dataset. We are actively working to secure additional compute resources to conduct this pre-training.*
63
+
64
+ ## Vision-Language
65
+
66
+ ### Evaluation on VQA
67
 
68
  We use the [LMM-Eval](https://github.com/ustcwhy/BitVLA/tree/main/lmms-eval) toolkit to conduct evaluations on VQA tasks. We provide the [transformers repo](https://github.com/ustcwhy/BitVLA/tree/main/transformers) in which we modify the [modeling_llava.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/llava/modeling_llava.py) and [modeling_siglip.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/siglip/modeling_siglip.py) to support the W1.58-A8 quantization.
69
 
 
94
 
95
  Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost.
96
 
97
+ ## Vision-Language-Action
98
 
99
+ ### OFT Training
100
 
101
+ #### 1. Preparing OFT
102
+ We fine-tune BitVLA using OFT training shown in [OpenVLA-OFT](https://github.com/moojink/openvla-oft/tree/main). First setup the environment as required by that project. You can refer to [SETUP.md](https://github.com/moojink/openvla-oft/blob/main/SETUP.md) and [LIBERO.md](https://github.com/moojink/openvla-oft/blob/main/LIBERO.md) for detailed instructions.
103
+
104
+ ```
105
+ conda create -n bitvla python=3.10 -y
106
+ conda activate bitvla
107
+ pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124
108
+
109
+ # or use the provided docker
110
+ # docker run --name nvidia_24_07 --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity
111
+
112
+ cd BitVLA
113
+ pip install -e openvla-oft/
114
+ pip install -e transformers
115
+
116
+ cd openvla-oft/
117
+
118
+ # install LIBERO
119
+ git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
120
+ pip install -e LIBERO/
121
+ # in BitVLA
122
+ pip install -r experiments/robot/libero/libero_requirements.txt
123
+
124
+ # install bitvla
125
+ pip install -e bitvla/
126
+ ```
127
+
128
+ We adopt the same dataset as OpenVLA-OFT for the fine-tuning on LIBERO. You can download the dataset from [HuggingFace](https://huggingface.co/datasets/openvla/modified_libero_rlds).
129
+
130
+ ```
131
+ git clone [email protected]:datasets/openvla/modified_libero_rlds
132
+ ```
133
+
134
+ #### 2. OFT fine-tuning
135
+
136
+ First convert the [BitVLA](https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16) to a format compatible with the VLA codebase.
137
+
138
+ ```
139
+ python convert_ckpt.py /path/to/bitvla-bitsiglipL-224px-bf16
140
+ ```
141
+
142
+ After that, you can finetune the BitVLA using the following command. Here we take LIBERO spatial as an example:
143
+
144
+ ```
145
+ torchrun --standalone --nnodes 1 --nproc-per-node 4 vla-scripts/finetune_bitnet.py \
146
+ --vla_path /path/to/bitvla-bitsiglipL-224px-bf16 \
147
+ --data_root_dir /path/to/modified_libero_rlds/ \
148
+ --dataset_name libero_spatial_no_noops \
149
+ --run_root_dir /path/to/save/your/ckpt \
150
+ --use_l1_regression True \
151
+ --warmup_steps 375 \
152
+ --use_lora False \
153
+ --num_images_in_input 2 \
154
+ --use_proprio True \
155
+ --batch_size 2 \
156
+ --grad_accumulation_steps 8 \
157
+ --learning_rate 1e-4 \
158
+ --max_steps 10001 \
159
+ --save_freq 10000 \
160
+ --save_latest_checkpoint_only False \
161
+ --image_aug True \
162
+ --run_id_note your_id
163
+ ```
164
+
165
+ ### Evaluation on LIBERO
166
+
167
+ You can download our fine-tuned BitVLA models from [HuggingFace](https://huggingface.co/collections/hongyuw/bitvla-68468fb1e3aae15dd8a4e36e). As an example for spatial set in LIBERO, run the following script for evaluation:
168
+
169
+ ```
170
+ python experiments/robot/libero/run_libero_eval_bitnet.py \
171
+ --pretrained_checkpoint /path/to/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16 \
172
+ --task_suite_name libero_spatial \
173
+ --info_in_path "information you want to show in path" \
174
+ --model_family "bitnet"
175
+ ```
176
+
177
+ ## Acknowledgement
178
+
179
+ This repository is built using [LMM-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [the HuggingFace's transformers](https://github.com/huggingface/transformers) and [OpenVLA-OFT](https://github.com/moojink/openvla-oft).
180
 
181
  ## Citation
182
 
 
192
  }
193
  ```
194
 
195
+ ## License
196
+ This project is licensed under the MIT License.
197
+
198
  ### Contact Information
199
 
200
+ For help or issues using models, please submit a GitHub issue.