Unicorn: Text-Only Data Synthesis for Vision Language Model Training Paper • 2503.22655 • Published Mar 28 • 39
OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation Paper • 2505.03912 • Published May 6 • 9
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning Paper • 2505.12448 • Published May 18 • 10
VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL Paper • 2505.15791 • Published May 21 • 6
Towards Affordance-Aware Robotic Dexterous Grasping with Human-like Priors Paper • 2508.08896 • Published Aug 12 • 10
QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning Paper • 2412.15576 • Published Dec 20, 2024
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model Paper • 2509.09372 • Published Sep 11 • 242
Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation Paper • 2508.19958 • Published Aug 27
High-Fidelity Simulated Data Generation for Real-World Zero-Shot Robotic Manipulation Learning with Gaussian Splatting Paper • 2510.10637 • Published Oct 12 • 12
HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models Paper • 2512.09928 • Published 14 days ago • 11
HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models Paper • 2512.09928 • Published 14 days ago • 11
RynnVLA-002: A Unified Vision-Language-Action and World Model Paper • 2511.17502 • Published Nov 21 • 25
RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation Paper • 2509.15212 • Published Sep 18 • 21
Exploring the Evolution of Physics Cognition in Video Generation: A Survey Paper • 2503.21765 • Published Mar 27 • 11
Accelerating Diffusion Transformers with Token-wise Feature Caching Paper • 2410.05317 • Published Oct 5, 2024
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration Paper • 2411.17686 • Published Nov 26, 2024 • 19
CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction Paper • 2412.06782 • Published Dec 9, 2024 • 7
PiTe: Pixel-Temporal Alignment for Large Video-Language Model Paper • 2409.07239 • Published Sep 11, 2024 • 15
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference Paper • 2403.14520 • Published Mar 21, 2024 • 35