RLinf: Reinforcement Learning Infrastructure for Agentic AI

RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

Model Description

The RLinf-openvlaoft-libero series is trained on RLinf/RLinf-OpenVLAOFT-LIBERO-xxx-Base-Lora (including libero90 and libero130) and Haozhan72/Openvla-oft-SFT-libero-xxx-traj1 (including libero10, libero-object, libero-goal and libero-spatial), using the same base models and training datasets as verl. Training with RLinf yields SOTA performance.

We use a mask to focus on valid action tokens, and compute token-level loss based on the Group Relative Policy Optimization (GRPO) advantage function, in order to enhance the model’s performance on spatial reasoning, object generalization, instruction generalization, and long-horizon tasks.

Evaluation and Results

We trained four models using RLinf:

RLinf-OpenVLAOFT-GRPO-LIBERO-90 Model (based on RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora)
- Recommended sampling settings: temperature = 1.6, top_p = 1.0
RLinf-OpenVLAOFT-LIBERO-130 Model (based on RLinf/RLinf-OpenVLAOFT-LIBERO-130-Base-Lora)
- Recommended sampling settings: temperature = 1.6, top_p = 1.0
RLinf-OpenVLAOFT-GRPO-LIBERO-object Model (based on Haozhan72/Openvla-oft-SFT-libero-object-traj1)
- Recommended sampling settings: temperature = 1.6, top_p = 1.0
RLinf-OpenVLAOFT-GRPO-LIBERO-spatial Model (based on Haozhan72/Openvla-oft-SFT-libero-spatial-traj1)
- Recommended sampling settings: temperature = 1.6, top_p = 1.0
RLinf-OpenVLAOFT-GRPO-LIBERO-goal Model (based on Haozhan72/Openvla-oft-SFT-libero-goal-traj1)
- Recommended sampling settings: temperature = 1.6, top_p = 1.0
RLinf-OpenVLAOFT-GRPO-LIBERO-long Model (based on Haozhan72/Openvla-oft-SFT-libero10-traj1)
- Recommended sampling settings: temperature = 1.6, top_p = 1.0

Benchmark Results

Sft models for LIBERO-90 and LIBERO-130 are trained by ourself following training reciepe from OpenVLA-OFT. And other sft models are from SimpleVLA-RL.

We evaluate each model according to its training configuration. Using libero_seed = 0 and evaluating 500 episodes for the Object, Spatial, Goal, and Long suites, 4,500 episodes for LIBERO-90, and 6,500 episodes for LIBERO-130. For the SFT-trained (LoRA-base) models, we set do_sample = False. For the RL-trained models, we set do_sample = True, temperature = 1.6, and enable rollout_epoch=2, and the final results are reported as the average across the two runs.

Model	Object	Spatial	Goal	Long	90	Average
sft models	28.83	52.22	49.40	14.92	79.28	66.07
trained with RLinf	97.68	94.76	93.96	90.93	96.44	95.79

Besides, we train one model (we named it libero-130 model) for all tasks in libero.

libero-130 model	Object	Spatial	Goal	Long	90	130(all)
sft models	50.20	51.61	49.40	11.90	42.67	42.09
trained with RLinf	99.60	98.69	98.09	93.45	98.02	97.85

How to Use

Please integrate the provided model with the RLinf codebase. To do so, modify the following parameters in the configuration file examples/embodiment/config/libero_10_grpo_openvlaoft.yaml:

Set rollout.model.model_path, actor.model.model_path, and actor.tokenizer.tokenizer_model to the path of the model checkpoint.

Note: If you intend to evaluate the model directly, make sure to set actor.model.is_lora to false.

License

This code repository and the model weights are licensed under the MIT License.

Downloads last month: 25

Safetensors

Model size

8B params

Tensor type

BF16

Video Preview

Reinforcement Learning

Model tree for RLinf/RLinf-OpenVLAOFT-LIBERO-130

Base model

RLinf/RLinf-OpenVLAOFT-LIBERO-130-Base-Lora

Finetuned

(1)

this model

Evaluation results

accuracy on libero_130
self-reported

97.850