arxiv:2603.17051

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Published on Mar 17

· Submitted by

Franklin on Mar 23

#1 Paper of the day

Upvote

Authors:

Xianghao Kong ,

Abstract

Astrolabe is an efficient online reinforcement learning framework for distilled autoregressive video models that improves generation quality through forward-process RL formulation and streaming training with multi-reward objectives.

AI-generated summary

Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.

View arXiv page View PDF Project page GitHub 48 Add to collection

Community

Franklinzhang

Paper submitter about 12 hours ago

Astrolabe is an efficient online Reinforcement Learning (RL) framework designed to align distilled autoregressive (AR) streaming video models with human visual preferences. Without sacrificing real-time inference speed, Astrolabe consistently and robustly improves visual aesthetics and across various baseline models.

ZJF666

about 9 hours ago

awesome work！

ShiHong8

about 6 hours ago

Some suggestions:
1.Positive-Negative Sample Contrast Quality: The effectiveness of the forward-process RL heavily depends on the quality of negative samples. If negatives are too easy or too hard, the learning signal is suboptimal. A curriculum learning strategy — starting from easy-to-distinguish pairs and gradually increasing difficulty — could improve fine-grained preference learning.

2.Sliding Window Boundary Artifacts: Applying RL updates only to local clip windows can cause distribution drift at window boundaries, leading to incoherence between adjacent segments. Adding overlapping regions between windows with consistency constraints, or applying temporal blending at boundaries, could mitigate this.

3.Multi-Reward Fusion Limitations: Weighted summation of multiple rewards cannot handle conflicts between inherently contradictory objectives (e.g., motion intensity vs. visual stability). Replacing weighted summation with multi-objective RL (MORL) to find Pareto-optimal trade-offs would be a more principled approach.

4.Dynamic Reference Policy Instability: Updating the reference policy too fast removes the constraint entirely; too slow makes it outdated. Using an exponential moving average (EMA) ensemble of historical checkpoints as the reference could provide more stable regularization.

5.Lack of Temporal Reward Hierarchy: Rewards appear to operate at the clip level, which may miss long-range temporal coherence issues. A hierarchical reward structure — with frame-level, clip-level, and global-level signals — could guide optimization at multiple temporal scales.

6.Post-hoc Alignment vs. Alignment-Aware Distillation: The framework aligns models after distillation, essentially patching an already degraded model. Integrating human preference signals directly into the distillation process could fundamentally reduce the alignment burden downstream.

Franklinzhang

about 6 hours ago

Thank you for your valuable suggestions! These profound insights are very enlightening for us. We will carefully consider each point you've raised and incorporate them into our subsequent design improvements, striving to continuously enhance the overall quality.