DreamVLA: A Vision-Language-Action Model
Dreamed with Comprehensive World Knowledge
Table of Contents:
Installation
Create an anaconda environment
conda create -n dreamvla python=3.10
conda activate dreamvla
Clone this repo
git clone https://github.com/Zhangwenyao1/DreamVLA
This repository's code is based on the Seer.
Install for CALVIN
Data Processing
Note: there is potential problem that Use .reshape(...) instead.
, just change it.
Dynamic Region:
Install co-tracker. Note download the checkpoints of co-tracker and put it to ./co-tracker/checkpoints
mv ./data_process/cotrack_extractor.py ./co-tracker/
cd co-tracker
python cotrack_extractor.py
SAM Feature:
Install SAM. Note download the checkpoints of SAM and put it to ./segment-anything/ckpts
.
cp dist_utils.py ./segment-anything/
mv ./data_info/ep_start_end_ids.npy <your_data_path>
mv ./data_process/sam_extractor.py ./segment-anything/
cd segment-anything
python sam_extractor.py
DINOv2 Feature:
Install DINOV2. Note download the checkpoints of dinov2 and put it to ./dinov2/ckpts
.
cp dist_utils.py ./dinov2/
mv ./data_process/dino_extractor.py ./dinov2/
cd dinov2
python dino_extractor.py
If you want to finetune our model, python dino_extractor.py
is must to run.
Merge all data and raw calvin dataset to produce the new dataset
python ./data_process/merge_sam_dino.py # merge sam and dino feature into new dataset
python ./data_process/merge_track.py # merge optical flow into new dataset
Training
Note: you need to change the detail of the *.sh in ./scripts/CALVIN_ABC_D/DreamVLA/
. Moreover, if you use less than 8 gpus, plase change the node_num in *.sh.
Pretrain:
bash ./scripts/CALVIN_ABC_D/DreamVLA/pretrain.sh
Finetune:
bash ./scripts/CALVIN_ABC_D/DreamVLA/finetune.sh
Evaluation
Down load our checkpoint and create checkpoints/
. Then put it into the file.
bash ./scripts/CALVIN_ABC_D/DreamVLA/eval.sh
Acknowledgement
We would like to express our deepest gratitude to Yang Tian for the technique support!!!
Citation
If you find our ideas / environments helpful, please cite our work at
article{dreamvla25,
author = {Wenyao Zhang and
Hongsi Liu and
Zekun Qi and
Yunan Wang and
Xinqiang Yu and
Jiazhao Zhang and
Runpei Dong and
Jiawei He and
He Wang and
Zhizheng Zhang and
Li Yi and
Wenjun Zeng and
Xin Jin},
title = {DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge},
journal = {CoRR},
volume = {abs/2507.04447},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2507.04447},
doi = {10.48550/ARXIV.2507.04447},
eprinttype = {arXiv},
eprint = {2507.04447}
}
Model tree for WenyaoZhang/DreamVLA
Base model
openai-community/gpt2