RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

Community Article Published August 11, 2025

image/png

TL;DR

We introduce RynnVLA-001, a vision-language-action model built upon large-scale video generative pre-training.

  • RynnVLA-001 is pretrained on ~12M ego-centric manipulation videos.
  • We unify next-frame prediction and next-action prediction into a single transformer.
  • We train a lightweight VAE to accurately compress action chunks into action embeddings.
  • Our RynnVLA-001 outperforms Pi-0 and GR00T-N1.5, in terms of both real-world task success rate and instruction-following capability.

Open-source links:

Introduction

Thanks to the availability of large-scale datasets, the past few years have witnessed rapid progress in language models, multimodal models, vision-based perception models, and generative models. In contrast, advancements in robotics models remain limited, hindered by the labor-intensive collection of large-scale robot manipulation data.

In this work, we attempt to alleviate the challenge by leveraging generative priors. We propose RynnVLA-001, a simple yet effective Vision-Language-Action (VLA) model, built upon a pretrained video generation model. The key insight of RynnVLA-001 is to implicitly transfer manipulation skills learned from human demonstrations in ego-centric videos to robotic arms. The overview of RynnVLA-001 is shown in the figure below. We first train a video generation model using ego-centric manipulation videos. Then, built on top of the base model, we unify next-frame prediction and next-action prediction into a single transformer.

image/png

Our proposed RynnVLA-001 enables a robot arm to successfully execute complex pick-and-place and long-horizon tasks by accurately following high-level language instructions.

Method

image/png

Stage1: Ego-centric Video Generation Model

The challenge for the scaling up of VLA models lies in that paired data for VLA training is limited. In this work, we transfer priors learned in video generation models to VLA models. In the setting of VLA models, actions are predicted according to current observations and language instructions. To mimic the inference scenarios of VLA models, the video generation model should be an Image-to-Video(I2V) model, which predicts future frames according to a given image. We adopt an autoregressive Transformer-based architecture for video generation. Furthermore, the prediction of actions relies on observations from ego-centric views. To this end, we curate 11.93M ego-centric human manipulation videos for training. These videos contain first-view human operations and focus on hand manipulations. Besides, we also filter 244K robotic manipulation videos from open-source datasets. In this stage, we exclusively use visual observations and language instructions and deliberately omit any corresponding action labels (such as joint states or end-effector positions) to compel the model to learn an implicit understanding of physical dynamics directly from pixels.

Stage2: VAE for Compressing Robot Action Chunks

In VLA models, predicting action chunks (short sequences of actions) rather than single-step actions has proven to be beneficial. This design choice is motivated by two key factors: 1) Avoiding repetitive predictions: Single-action prediction may yield negligible visual change per step, causing the model to repeatedly output the same action and get stuck. 2) Efficiency: Predicting multiple actions at a time reduces computational overhead. To enable chunk-level prediction while maintaining action smoothness, we train a lightweight VAE to encode each robot action chunk into a compact and continuous embedding. The VLA model then only needs to predict a single embedding vector, which can be decoded into a coherent sequence of actions.

Stage3: Vision-Language-Action Model

At the final stage, we fine-tune the pretrained ego-centric video generation model into a VLA model by integrating the VAE-based action representations. In this stage, we unify next-frame prediction and next-action prediction into a single transformer. The model is trained to predict the action embeddings and visual tokens. Since the action embeddings are continuous, we need to a separate head to predict the action embeddings. The action prediction head is a lightweight one with only a single linear layer. The training of the action head is supervised by L1 loss. Besides, the model is also optimized to predict future visual observations, which is supervised by cross-entropy loss between predicted vision tokens and ground-truth vision tokens.

Inference

At inference time, the model takes an RGB observation and a language instruction as inputs, and generates an action embedding. This embedding is passed through the VAE decoder to reconstruct a sequence of low-level robot actions. These actions are then executed by the robot. After the execution of the predicted action chunk, the updated observation is fed back into the model, and the process repeats until the task is completed. Notably, during inference we predict only the action embedding and discard future vision token prediction to improve efficiency, as predicting large numbers of vision tokens is computationally expensive.

Community

Sign up or log in to comment