DreamVLA: A Vision-Language-Action Model
Dreamed with Comprehensive World Knowledge

Installation

Create an anaconda environment

conda create -n dreamvla python=3.10
conda activate dreamvla

Clone this repo

git clone https://github.com/Zhangwenyao1/DreamVLA

This repository's code is based on the Seer.

Install for CALVIN

Data Processing

Note: there is potential problem that Use .reshape(...) instead., just change it.

Dynamic Region:

Install co-tracker. Note download the checkpoints of co-tracker and put it to ./co-tracker/checkpoints

mv ./data_process/cotrack_extractor.py ./co-tracker/
cd co-tracker
python cotrack_extractor.py

SAM Feature:

Install SAM. Note download the checkpoints of SAM and put it to ./segment-anything/ckpts.

cp dist_utils.py ./segment-anything/
mv ./data_info/ep_start_end_ids.npy <your_data_path>
mv ./data_process/sam_extractor.py ./segment-anything/
cd segment-anything
python sam_extractor.py

DINOv2 Feature:

Install DINOV2. Note download the checkpoints of dinov2 and put it to ./dinov2/ckpts.

cp dist_utils.py ./dinov2/
mv ./data_process/dino_extractor.py ./dinov2/
cd dinov2
python dino_extractor.py

If you want to finetune our model, python dino_extractor.py is must to run.

Merge all data and raw calvin dataset to produce the new dataset

python ./data_process/merge_sam_dino.py # merge sam and dino feature into new dataset
python ./data_process/merge_track.py # merge optical flow into new dataset

Training

Note: you need to change the detail of the *.sh in ./scripts/CALVIN_ABC_D/DreamVLA/. Moreover, if you use less than 8 gpus, plase change the node_num in *.sh.

Pretrain:

bash ./scripts/CALVIN_ABC_D/DreamVLA/pretrain.sh

Finetune:

bash ./scripts/CALVIN_ABC_D/DreamVLA/finetune.sh

Evaluation

Down load our checkpoint and create checkpoints/. Then put it into the file.

bash ./scripts/CALVIN_ABC_D/DreamVLA/eval.sh

Acknowledgement

We would like to express our deepest gratitude to Yang Tian for the technique support!!!

Citation

If you find our ideas / environments helpful, please cite our work at

article{dreamvla25,
          author = {Wenyao Zhang and
                    Hongsi Liu and
                    Zekun Qi and
                    Yunan Wang and
                    Xinqiang Yu and
                    Jiazhao Zhang and
                    Runpei Dong and
                    Jiawei He and
                    He Wang and
                    Zhizheng Zhang and
                    Li Yi and 
                    Wenjun Zeng and
                    Xin Jin},
          title        = {DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge},
          journal      = {CoRR},
          volume       = {abs/2507.04447},
          year         = {2025},
          url          = {https://doi.org/10.48550/arXiv.2507.04447},
          doi          = {10.48550/ARXIV.2507.04447},
          eprinttype    = {arXiv},
          eprint       = {2507.04447}
        }

WenyaoZhang
/

DreamVLA