readings
updated
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper
• 2408.11796
• Published
• 58
TableBench: A Comprehensive and Complex Benchmark for Table Question
Answering
Paper
• 2408.09174
• Published
• 52
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper
• 2408.10914
• Published
• 45
Open-FinLLMs: Open Multimodal Large Language Models for Financial
Applications
Paper
• 2408.11878
• Published
• 63
Law of Vision Representation in MLLMs
Paper
• 2408.16357
• Published
• 95
CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting
Mitigation
Paper
• 2408.14572
• Published
• 8
SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
Paper
• 2408.15545
• Published
• 38
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
• 2409.02889
• Published
• 54
LongCite: Enabling LLMs to Generate Fine-grained Citations in
Long-context QA
Paper
• 2409.02897
• Published
• 48
Attention Heads of Large Language Models: A Survey
Paper
• 2409.03752
• Published
• 92
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free
Real Image Editing
Paper
• 2409.01322
• Published
• 96
Towards a Unified View of Preference Learning for Large Language Models:
A Survey
Paper
• 2409.02795
• Published
• 72
Paper Copilot: A Self-Evolving and Efficient LLM System for Personalized
Academic Assistance
Paper
• 2409.04593
• Published
• 26
ProteinBench: A Holistic Evaluation of Protein Foundation Models
Paper
• 2409.06744
• Published
• 8
Qwen2.5-Coder Technical Report
Paper
• 2409.12186
• Published
• 153
Training Language Models to Self-Correct via Reinforcement Learning
Paper
• 2409.12917
• Published
• 140
HelloBench: Evaluating Long Text Generation Capabilities of Large
Language Models
Paper
• 2409.16191
• Published
• 41
Making Text Embedders Few-Shot Learners
Paper
• 2409.15700
• Published
• 29
Instruction Following without Instruction Tuning
Paper
• 2409.14254
• Published
• 29
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices
Paper
• 2410.00531
• Published
• 33
From Code to Correctness: Closing the Last Mile of Code Generation with
Hierarchical Debugging
Paper
• 2410.01215
• Published
• 39
Not All LLM Reasoners Are Created Equal
Paper
• 2410.01748
• Published
• 29
RATIONALYST: Pre-training Process-Supervision for Improving Reasoning
Paper
• 2410.01044
• Published
• 35
Training Language Models on Synthetic Edit Sequences Improves Code
Synthesis
Paper
• 2410.02749
• Published
• 13
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference
Acceleration
Paper
• 2410.02367
• Published
• 50
Addition is All You Need for Energy-efficient Language Models
Paper
• 2410.00907
• Published
• 151
Selective Attention Improves Transformer
Paper
• 2410.02703
• Published
• 25
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Paper
• 2410.08164
• Published
• 26
Toward General Instruction-Following Alignment for Retrieval-Augmented
Generation
Paper
• 2410.09584
• Published
• 48
A Unified View of Delta Parameter Editing in Post-Trained Large-Scale
Models
Paper
• 2410.13841
• Published
• 16
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex
Diagrams in Coding Tasks
Paper
• 2410.12381
• Published
• 43
Revealing the Barriers of Language Agents in Planning
Paper
• 2410.12409
• Published
• 27
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for
Contrastive Loss
Paper
• 2410.17243
• Published
• 92
Why Does the Effective Context Length of LLMs Fall Short?
Paper
• 2410.18745
• Published
• 17
Robots Pre-train Robots: Manipulation-Centric Robotic Representation
from Large-Scale Robot Dataset
Paper
• 2410.22325
• Published
• 10
A Large Recurrent Action Model: xLSTM enables Fast Inference for
Robotics Tasks
Paper
• 2410.22391
• Published
• 22
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM
Data Contamination
Paper
• 2411.03823
• Published
• 49
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle
Grandmaster Level
Paper
• 2411.03562
• Published
• 69
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge
in RAG Systems
Paper
• 2411.02959
• Published
• 71
Let the Flows Tell: Solving Graph Combinatorial Optimization Problems
with GFlowNets
Paper
• 2305.17010
• Published
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
Paper
• 2411.04905
• Published
• 127
Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test
Generation: An Empirical Study
Paper
• 2411.02462
• Published
• 10
Large Language Models Can Self-Improve in Long-context Reasoning
Paper
• 2411.08147
• Published
• 65
Cut Your Losses in Large-Vocabulary Language Models
Paper
• 2411.09009
• Published
• 49
ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical
Prediction?
Paper
• 2411.06469
• Published
• 17
SlimLM: An Efficient Small Language Model for On-Device Document
Assistance
Paper
• 2411.09944
• Published
• 12
SageAttention2 Technical Report: Accurate 4 Bit Attention for
Plug-and-play Inference Acceleration
Paper
• 2411.10958
• Published
• 57
Enhancing the Reasoning Ability of Multimodal Large Language Models via
Mixed Preference Optimization
Paper
• 2411.10442
• Published
• 87
Hymba: A Hybrid-head Architecture for Small Language Models
Paper
• 2411.13676
• Published
• 47
Natural Language Reinforcement Learning
Paper
• 2411.14251
• Published
• 31
Cautious Optimizers: Improving Training with One Line of Code
Paper
• 2411.16085
• Published
• 19
Predicting Emergent Capabilities by Finetuning
Paper
• 2411.16035
• Published
• 7
Star Attention: Efficient LLM Inference over Long Sequences
Paper
• 2411.17116
• Published
• 53
o1-Coder: an o1 Replication for Coding
Paper
• 2412.00154
• Published
• 44
Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's
Reasoning Capability
Paper
• 2411.19943
• Published
• 62
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
• 2412.04467
• Published
• 117
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and
Proactive Robotic Failure Detection
Paper
• 2412.04455
• Published
• 38
Personalized Multimodal Large Language Models: A Survey
Paper
• 2412.02142
• Published
• 13
Evaluating Language Models as Synthetic Data Generators
Paper
• 2412.03679
• Published
• 47
Expanding Performance Boundaries of Open-Source Multimodal Models with
Model, Data, and Test-Time Scaling
Paper
• 2412.05271
• Published
• 160
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at
Scale
Paper
• 2412.05237
• Published
• 46
EXAONE 3.5: Series of Large Language Models for Real-world Use Cases
Paper
• 2412.04862
• Published
• 50
Moto: Latent Motion Token as the Bridging Language for Robot
Manipulation
Paper
• 2412.04445
• Published
• 22
Evaluating and Aligning CodeLLMs on Human Preference
Paper
• 2412.05210
• Published
• 50
POINTS1.5: Building a Vision-Language Model towards Real World
Applications
Paper
• 2412.08443
• Published
• 38
Paper
• 2412.08905
• Published
• 122
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
Long-term Streaming Video and Audio Interactions
Paper
• 2412.09596
• Published
• 97
GenEx: Generating an Explorable World
Paper
• 2412.09624
• Published
• 98
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World
Tasks
Paper
• 2412.14161
• Published
• 51
Paper
• 2412.15115
• Published
• 377
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic
Long-context Multitasks
Paper
• 2412.15204
• Published
• 38
How to Synthesize Text Data without Model Collapse?
Paper
• 2412.14689
• Published
• 53
Offline Reinforcement Learning for LLM Multi-Step Reasoning
Paper
• 2412.16145
• Published
• 38
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
Paper
• 2412.13649
• Published
• 21
B-STaR: Monitoring and Balancing Exploration and Exploitation in
Self-Taught Reasoners
Paper
• 2412.17256
• Published
• 47
RobustFT: Robust Supervised Fine-tuning for Large Language Models under
Noisy Response
Paper
• 2412.14922
• Published
• 88
Diving into Self-Evolving Training for Multimodal Reasoning
Paper
• 2412.17451
• Published
• 42
Revisiting In-Context Learning with Long Context Language Models
Paper
• 2412.16926
• Published
• 32
Outcome-Refining Process Supervision for Code Generation
Paper
• 2412.15118
• Published
• 19
DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought
Paper
• 2412.17498
• Published
• 22
NILE: Internal Consistency Alignment in Large Language Models
Paper
• 2412.16686
• Published
• 8
LearnLM: Improving Gemini for Learning
Paper
• 2412.16429
• Published
• 22
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital
World
Paper
• 2412.17589
• Published
• 14
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D
Scene Understanding
Paper
• 2412.18450
• Published
• 36
Fourier Position Embedding: Enhancing Attention's Periodic Extension for
Length Generalization
Paper
• 2412.17739
• Published
• 41
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
Paper
• 2412.14711
• Published
• 16
Ensembling Large Language Models with Process Reward-Guided Tree Search
for Better Complex Reasoning
Paper
• 2412.15797
• Published
• 18
YuLan-Mini: An Open Data-efficient Language Model
Paper
• 2412.17743
• Published
• 66
Molar: Multimodal LLMs with Collaborative Filtering Alignment for
Enhanced Sequential Recommendation
Paper
• 2412.18176
• Published
• 16
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
Paper
• 2412.18072
• Published
• 18
Explanatory Instructions: Towards Unified Vision Tasks Understanding and
Zero-shot Generalization
Paper
• 2412.18525
• Published
• 74
Efficiently Serving LLM Reasoning Programs with Certaindex
Paper
• 2412.20993
• Published
• 36
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on
Self-invoking Code Generation
Paper
• 2412.21199
• Published
• 13
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse
Task Synthesis
Paper
• 2412.19723
• Published
• 87
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining
Paper
• 2501.00958
• Published
• 109
CodeElo: Benchmarking Competition-level Code Generation of LLMs with
Human-comparable Elo Ratings
Paper
• 2501.01257
• Published
• 51
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent
Diffusion Models
Paper
• 2501.01423
• Published
• 44
ProgCo: Program Helps Self-Correction of Large Language Models
Paper
• 2501.01264
• Published
• 26
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for
Real-World Video Super-Resolution
Paper
• 2501.02976
• Published
• 56
BoostStep: Boosting mathematical capability of Large Language Models via
improved single-step reasoning
Paper
• 2501.03226
• Published
• 43
Test-time Computing: from System-1 Thinking to System-2 Thinking
Paper
• 2501.02497
• Published
• 45
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language
Models
Paper
• 2501.03262
• Published
• 104
MotionBench: Benchmarking and Improving Fine-grained Video Motion
Understanding for Vision Language Models
Paper
• 2501.02955
• Published
• 44
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
Vision Token
Paper
• 2501.03895
• Published
• 52
Cosmos World Foundation Model Platform for Physical AI
Paper
• 2501.03575
• Published
• 82
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides
Paper
• 2501.03936
• Published
• 23
An Empirical Study of Autoregressive Pre-training from Videos
Paper
• 2501.05453
• Published
• 41
Enhancing Human-Like Responses in Large Language Models
Paper
• 2501.05032
• Published
• 61
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub
Issue Resolution
Paper
• 2501.05040
• Published
• 15
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
• 2501.06186
• Published
• 65
VideoRAG: Retrieval-Augmented Generation over Video Corpus
Paper
• 2501.05874
• Published
• 75
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?
Paper
• 2501.05510
• Published
• 44
The Lessons of Developing Process Reward Models in Mathematical
Reasoning
Paper
• 2501.07301
• Published
• 100
Tensor Product Attention Is All You Need
Paper
• 2501.06425
• Published
• 90
Transformer^2: Self-adaptive LLMs
Paper
• 2501.06252
• Published
• 55
WebWalker: Benchmarking LLMs in Web Traversal
Paper
• 2501.07572
• Published
• 23
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical
Reasoning
Paper
• 2501.06458
• Published
• 31
Towards Best Practices for Open Datasets for LLM Training
Paper
• 2501.08365
• Published
• 62
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Paper
• 2501.08828
• Published
• 30
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising
Steps
Paper
• 2501.09732
• Published
• 72
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with
Large Language Models
Paper
• 2501.09686
• Published
• 41
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper
• 2501.09747
• Published
• 28
Evolving Deeper LLM Thinking
Paper
• 2501.09891
• Published
• 115
PaSa: An LLM Agent for Comprehensive Academic Paper Search
Paper
• 2501.10120
• Published
• 54
Agent-R: Training Language Model Agents to Reflect via Iterative
Self-Training
Paper
• 2501.11425
• Published
• 109
Demons in the Detail: On Implementing Load Balancing Loss for Training
Specialized Mixture-of-Expert Models
Paper
• 2501.11873
• Published
• 67
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
Paper
• 2501.12948
• Published
• 440
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative
Textual Feedback
Paper
• 2501.12895
• Published
• 61
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding
Paper
• 2501.13106
• Published
• 90
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Paper
• 2501.12599
• Published
• 126
Autonomy-of-Experts Models
Paper
• 2501.13074
• Published
• 44
SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
Paper
• 2501.13200
• Published
• 69
Sigma: Differential Rescaling of Query, Key and Value for Efficient
Language Models
Paper
• 2501.13629
• Published
• 48
Baichuan-Omni-1.5 Technical Report
Paper
• 2501.15368
• Published
• 60
Qwen2.5-1M Technical Report
Paper
• 2501.15383
• Published
• 72
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Paper
• 2501.16975
• Published
• 32
Optimizing Large Language Model Training Using FP4 Quantization
Paper
• 2501.17116
• Published
• 36
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model
Post-training
Paper
• 2501.17161
• Published
• 124
Atla Selene Mini: A General Purpose Evaluation Model
Paper
• 2501.17195
• Published
• 35
Critique Fine-Tuning: Learning to Critique is More Effective than
Learning to Imitate
Paper
• 2501.17703
• Published
• 59
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Paper
• 2501.18585
• Published
• 61
s1: Simple test-time scaling
Paper
• 2501.19393
• Published
• 124
Reward-Guided Speculative Decoding for Efficient LLM Reasoning
Paper
• 2501.19324
• Published
• 39
GuardReasoner: Towards Reasoning-based LLM Safeguards
Paper
• 2501.18492
• Published
• 88
The Differences Between Direct Alignment Algorithms are a Blur
Paper
• 2502.01237
• Published
• 113
Process Reinforcement through Implicit Rewards
Paper
• 2502.01456
• Published
• 62
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning
Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
Paper
• 2502.01081
• Published
• 13
Scaling Embedding Layers in Language Models
Paper
• 2502.01637
• Published
• 24
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper
• 2502.02339
• Published
• 23
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion
Transformer
Paper
• 2502.01105
• Published
• 21
Large Language Model Guided Self-Debugging Code Generation
Paper
• 2502.02928
• Published
• 13
TwinMarket: A Scalable Behavioral and Social Simulation for Financial
Markets
Paper
• 2502.01506
• Published
• 38
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper
• 2502.02737
• Published
• 255
Demystifying Long Chain-of-Thought Reasoning in LLMs
Paper
• 2502.03373
• Published
• 58
LIMO: Less is More for Reasoning
Paper
• 2502.03387
• Published
• 62
ConceptAttention: Diffusion Transformers Learn Highly Interpretable
Features
Paper
• 2502.04320
• Published
• 36
Enhancing Code Generation for Low-Resource Languages: No Silver Bullet
Paper
• 2501.19085
• Published
• 5
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time
Scaling
Paper
• 2502.06703
• Published
• 152
SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data
Annotators
Paper
• 2502.06394
• Published
• 89
Exploring the Limit of Outcome Reward for Learning Mathematical
Reasoning
Paper
• 2502.06781
• Published
• 58
Lossless Acceleration of Large Language Models with Hierarchical
Drafting based on Temporal Locality in Speculative Decoding
Paper
• 2502.05609
• Published
• 18
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and
Generation
Paper
• 2502.05415
• Published
• 20
Paper
• 2502.06049
• Published
• 31
The Hidden Life of Tokens: Reducing Hallucination of Large
Vision-Language Models via Visual Information Steering
Paper
• 2502.03628
• Published
• 12
Paper
• 2502.06786
• Published
• 32
History-Guided Video Diffusion
Paper
• 2502.06764
• Published
• 12
CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for
Zero-Shot Customized Video Diffusion Transformers
Paper
• 2502.06527
• Published
• 11
The Curse of Depth in Large Language Models
Paper
• 2502.05795
• Published
• 40
MetaChain: A Fully-Automated and Zero-Code Framework for LLM Agents
Paper
• 2502.05957
• Published
• 15
Competitive Programming with Large Reasoning Models
Paper
• 2502.06807
• Published
• 69
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction
Paper
• 2502.07316
• Published
• 50
Teaching Language Models to Critique via Reinforcement Learning
Paper
• 2502.03492
• Published
• 24
Expect the Unexpected: FailSafe Long Context QA for Finance
Paper
• 2502.06329
• Published
• 133
Scaling Pre-training to One Hundred Billion Data for Vision Language
Models
Paper
• 2502.07617
• Published
• 29
LLMs Can Easily Learn to Reason from Demonstrations Structure, not
content, is what matters!
Paper
• 2502.07374
• Published
• 40
Retrieval-augmented Large Language Models for Financial Time Series
Forecasting
Paper
• 2502.05878
• Published
• 40
Hephaestus: Improving Fundamental Agent Capabilities of Large Language
Models through Continual Pre-Training
Paper
• 2502.06589
• Published
• 21
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
Paper
• 2502.07445
• Published
• 11
TransMLA: Multi-head Latent Attention Is All You Need
Paper
• 2502.07864
• Published
• 57
Distillation Scaling Laws
Paper
• 2502.08606
• Published
• 47
LLM Pretraining with Continuous Concepts
Paper
• 2502.08524
• Published
• 30
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on
a Single GPU
Paper
• 2502.08910
• Published
• 148
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient
Text-to-Image Generation
Paper
• 2502.08690
• Published
• 43
SelfCite: Self-Supervised Alignment for Context Attribution in Large
Language Models
Paper
• 2502.09604
• Published
• 37
Exploring the Potential of Encoder-free Architectures in 3D LMMs
Paper
• 2502.09620
• Published
• 26
Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging -- An Open Recipe
Paper
• 2502.09056
• Published
• 31
Logical Reasoning in Large Language Models: A Survey
Paper
• 2502.09100
• Published
• 24
DexTrack: Towards Generalizable Neural Tracking Control for Dexterous
Manipulation from Human References
Paper
• 2502.09614
• Published
• 9
Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights
Paper
• 2502.09619
• Published
• 36
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language
Models for Vision-Driven Embodied Agents
Paper
• 2502.09560
• Published
• 35
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of
Physical Concept Understanding
Paper
• 2502.08946
• Published
• 191
ZeroBench: An Impossible Visual Benchmark for Contemporary Large
Multimodal Models
Paper
• 2502.09696
• Published
• 43
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
Paper
• 2502.10391
• Published
• 34
Diverse Inference and Verification for Advanced Reasoning
Paper
• 2502.09955
• Published
• 18
AdaPTS: Adapting Univariate Foundation Models to Probabilistic
Multivariate Time Series Forecasting
Paper
• 2502.10235
• Published
• 9
We Can't Understand AI Using our Existing Vocabulary
Paper
• 2502.07586
• Published
• 11
FoNE: Precise Single-Token Number Embeddings via Fourier Features
Paper
• 2502.09741
• Published
• 15
Region-Adaptive Sampling for Diffusion Transformers
Paper
• 2502.10389
• Published
• 53
Large Language Diffusion Models
Paper
• 2502.09992
• Published
• 126
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
Attention
Paper
• 2502.11089
• Published
• 167
Learning Getting-Up Policies for Real-World Humanoid Robots
Paper
• 2502.12152
• Published
• 42
ReLearn: Unlearning via Learning for Large Language Models
Paper
• 2502.11190
• Published
• 30
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance
Software Engineering?
Paper
• 2502.12115
• Published
• 46
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and
Generation
Paper
• 2502.12148
• Published
• 17
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on
Continual Pre-Training
Paper
• 2502.11196
• Published
• 23
SURGE: On the Potential of Large Language Models as General-Purpose
Surrogate Code Executors
Paper
• 2502.11167
• Published
• 10
Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising
Trajectory Sharpening
Paper
• 2502.12146
• Published
• 16
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning
in Diffusion Models
Paper
• 2502.10458
• Published
• 38
Intuitive physics understanding emerges from self-supervised pretraining
on natural videos
Paper
• 2502.11831
• Published
• 20
CRANE: Reasoning with constrained LLM generation
Paper
• 2502.09061
• Published
• 21
Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with
Reinforcement Learning
Paper
• 2502.10550
• Published
• 8
MagicArticulate: Make Your 3D Models Articulation-Ready
Paper
• 2502.12135
• Published
• 8
Soundwave: Less is More for Speech-Text Alignment in LLMs
Paper
• 2502.12900
• Published
• 86
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and
Object Manipulation
Paper
• 2502.13143
• Published
• 31
Multimodal Mamba: Decoder-only Multimodal State Space Model via
Quadratic to Linear Distillation
Paper
• 2502.13145
• Published
• 38
FLAG-Trader: Fusion LLM-Agent with Gradient-based Reinforcement Learning
for Financial Trading
Paper
• 2502.11433
• Published
• 36
You Do Not Fully Utilize Transformer's Representation Capacity
Paper
• 2502.09245
• Published
• 37
Magma: A Foundation Model for Multimodal AI Agents
Paper
• 2502.13130
• Published
• 58
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly
Possess Test-Time Scaling Capabilities?
Paper
• 2502.12215
• Published
• 16
Qwen2.5-VL Technical Report
Paper
• 2502.13923
• Published
• 214
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song
Generation
Paper
• 2502.13128
• Published
• 41
Craw4LLM: Efficient Web Crawling for LLM Pretraining
Paper
• 2502.13347
• Published
• 30
Small Models Struggle to Learn from Strong Reasoners
Paper
• 2502.12143
• Published
• 39
Is That Your Final Answer? Test-Time Scaling Improves Selective Question
Answering
Paper
• 2502.13962
• Published
• 28
AdaptiveStep: Automatically Dividing Reasoning Step through Model
Confidence
Paper
• 2502.13943
• Published
• 8
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features
Paper
• 2502.14786
• Published
• 158
S*: Test Time Scaling for Code Generation
Paper
• 2502.14382
• Published
• 63
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
Paper
• 2502.14502
• Published
• 91
Does Time Have Its Place? Temporal Heads: Where Language Models Recall
Time-specific Information
Paper
• 2502.14258
• Published
• 26
VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit
Matching Visual Cues
Paper
• 2502.12084
• Published
• 35
SurveyX: Academic Survey Automation via Large Language Models
Paper
• 2502.14776
• Published
• 100
Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and
Mixture-of-Experts Optimization Alignment
Paper
• 2502.16894
• Published
• 32
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
Paper
• 2502.17157
• Published
• 52
VideoGrain: Modulating Space-Time Attention for Multi-grained Video
Editing
Paper
• 2502.17258
• Published
• 79
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
Paper
• 2502.18137
• Published
• 60
Kanana: Compute-efficient Bilingual Language Models
Paper
• 2502.18934
• Published
• 65
Self-rewarding correction for mathematical reasoning
Paper
• 2502.19613
• Published
• 82
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language
Models via Mixture-of-LoRAs
Paper
• 2503.01743
• Published
• 89
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper
• 2503.01785
• Published
• 86
CodeArena: A Collective Evaluation Platform for LLM Code Generation
Paper
• 2503.01295
• Published
• 8
Babel: Open Multilingual Large Language Models Serving Over 90% of
Global Speakers
Paper
• 2503.00865
• Published
• 64
ABC: Achieving Better Control of Multimodal Embeddings using VLMs
Paper
• 2503.00329
• Published
• 20
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper
• 2503.04130
• Published
• 96
EgoLife: Towards Egocentric Life Assistant
Paper
• 2503.03803
• Published
• 46
START: Self-taught Reasoner with Tools
Paper
• 2503.04625
• Published
• 113
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Paper
• 2503.04724
• Published
• 72
Unified Reward Model for Multimodal Understanding and Generation
Paper
• 2503.05236
• Published
• 123
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive
Cognitive-Inspired Sketching
Paper
• 2503.05179
• Published
• 46
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper
• 2503.02130
• Published
• 32
BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation
for Everyday Household Activities
Paper
• 2503.05652
• Published
• 11
LoRACode: LoRA Adapters for Code Embeddings
Paper
• 2503.05315
• Published
• 13
Feature-Level Insights into Artificial Text Detection with Sparse
Autoencoders
Paper
• 2503.03601
• Published
• 232
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation
for Feature Implementation
Paper
• 2503.06680
• Published
• 20
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
Two-Stage Rule-Based RL
Paper
• 2503.07536
• Published
• 88
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by
Imitating Human Annotator Trajectories
Paper
• 2503.08625
• Published
• 27
Gemini Embedding: Generalizable Embeddings from Gemini
Paper
• 2503.07891
• Published
• 45
Implicit Reasoning in Transformers is Reasoning through Shortcuts
Paper
• 2503.07604
• Published
• 23
Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning
Paper
• 2503.07572
• Published
• 48
AnyMoLe: Any Character Motion In-betweening Leveraging Video Diffusion
Models
Paper
• 2503.08417
• Published
• 8
Mixture of Experts Made Intrinsically Interpretable
Paper
• 2503.07639
• Published
• 10
AI-native Memory 2.0: Second Me
Paper
• 2503.08102
• Published
• 13
TPDiff: Temporal Pyramid Video Diffusion Model
Paper
• 2503.09566
• Published
• 45
Block Diffusion: Interpolating Between Autoregressive and Diffusion
Language Models
Paper
• 2503.09573
• Published
• 76
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based
VLM Agent Training
Paper
• 2503.08525
• Published
• 17
More Documents, Same Length: Isolating the Challenge of Multiple
Documents in RAG
Paper
• 2503.04388
• Published
• 17
Quantizing Large Language Models for Code Generation: A Differentiated
Replication
Paper
• 2503.07103
• Published
• 8
Adversarial Data Collection: Human-Collaborative Perturbations for
Efficient and Robust Robotic Imitation Learning
Paper
• 2503.11646
• Published
• 34
Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal
Control
Paper
• 2503.14492
• Published
• 20
φ-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time
Exploration and Exploitation
Paper
• 2503.13288
• Published
• 51
Stop Overthinking: A Survey on Efficient Reasoning for Large Language
Models
Paper
• 2503.16419
• Published
• 77
DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers
Paper
• 2503.14487
• Published
• 28
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper
• 2503.15558
• Published
• 50
Expert Race: A Flexible Routing Strategy for Scaling Diffusion
Transformer with Mixture of Experts
Paper
• 2503.16057
• Published
• 14
Paper
• 2503.16425
• Published
• 16
BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space
Complexity?
Paper
• 2503.15242
• Published
• 10
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play
Visual Games with Keyboards and Mouse
Paper
• 2503.16365
• Published
• 41
Reinforcement Learning for Reasoning in Small LLMs: What Works and What
Doesn't
Paper
• 2503.16219
• Published
• 52
I Have Covered All the Bases Here: Interpreting Reasoning Features in
Large Language Models via Sparse Autoencoders
Paper
• 2503.18878
• Published
• 119
Video-T1: Test-Time Scaling for Video Generation
Paper
• 2503.18942
• Published
• 90
AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and
Symbolic Reasoning
Paper
• 2503.18769
• Published
• 11
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Paper
• 2503.18931
• Published
• 30
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Paper
• 2503.19325
• Published
• 73
Exploring Hallucination of Large Multimodal Models in Video
Understanding: Benchmark, Analysis and Mitigation
Paper
• 2503.19622
• Published
• 31
Scaling Vision Pre-Training to 4K Resolution
Paper
• 2503.19903
• Published
• 41
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement
Learning
Paper
• 2503.21620
• Published
• 62
Large Language Model Agent: A Survey on Methodology, Applications and
Challenges
Paper
• 2503.21460
• Published
• 83
ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large
Reasoning Models with Iterative Retrieval Augmented Generation
Paper
• 2503.21729
• Published
• 29
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for
Embodied Interactive Tasks
Paper
• 2503.21696
• Published
• 23
Think Before Recommend: Unleashing the Latent Reasoning Power for
Sequential Recommendation
Paper
• 2503.22675
• Published
• 36
Segment Any Motion in Videos
Paper
• 2503.22268
• Published
• 19
Your ViT is Secretly an Image Segmentation Model
Paper
• 2503.19108
• Published
• 25
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large
Language Models
Paper
• 2503.24235
• Published
• 54
Any2Caption:Interpreting Any Condition to Caption for Controllable Video
Generation
Paper
• 2503.24379
• Published
• 76
Exploring the Effect of Reinforcement Learning on Video Understanding:
Insights from SEED-Bench-R1
Paper
• 2503.24376
• Published
• 38
Z1: Efficient Test-time Scaling with Code
Paper
• 2504.00810
• Published
• 26
Paper
• 2504.00927
• Published
• 56
Scaling Language-Free Visual Representation Learning
Paper
• 2504.01017
• Published
• 32
Command A: An Enterprise-Ready Large Language Model
Paper
• 2504.00698
• Published
• 29
MergeVQ: A Unified Framework for Visual Generation and Representation
with Disentangled Token Merging and Quantization
Paper
• 2504.00999
• Published
• 95
ScholarCopilot: Training Large Language Models for Academic Writing with
Accurate Citations
Paper
• 2504.00824
• Published
• 43
PaperBench: Evaluating AI's Ability to Replicate AI Research
Paper
• 2504.01848
• Published
• 37
Articulated Kinematics Distillation from Video Diffusion Models
Paper
• 2504.01204
• Published
• 23
Advances and Challenges in Foundation Agents: From Brain-Inspired
Intelligence to Evolutionary, Collaborative, and Safe Systems
Paper
• 2504.01990
• Published
• 303
ZClip: Adaptive Spike Mitigation for LLM Pre-Training
Paper
• 2504.02507
• Published
• 88
SmolVLM: Redefining small and efficient multimodal models
Paper
• 2504.05299
• Published
• 205
URECA: Unique Region Caption Anything
Paper
• 2504.05305
• Published
• 35
DDT: Decoupled Diffusion Transformer
Paper
• 2504.05741
• Published
• 77
RobustDexGrasp: Robust Dexterous Grasping of General Objects from
Single-view Perception
Paper
• 2504.05287
• Published
• 6
Paper
• 2504.07491
• Published
• 137
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning
Paper
• 2504.07128
• Published
• 87
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
Paper
• 2504.08685
• Published
• 130
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for
Autoregressive Image Generation
Paper
• 2504.08736
• Published
• 46
MineWorld: a Real-Time and Open-Source Interactive World Model on
Minecraft
Paper
• 2504.08388
• Published
• 42
PixelFlow: Pixel-Space Generative Models with Flow
Paper
• 2504.07963
• Published
• 18
Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning
vs. Memorization in Large Language Models
Paper
• 2504.05262
• Published
• 11
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
Paper
• 2504.05303
• Published
• 5
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
• 2504.10479
• Published
• 306
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday
Home Clusters
Paper
• 2504.08791
• Published
• 139
FUSION: Fully Integration of Vision-Language Representations for Deep
Cross-Modal Understanding
Paper
• 2504.09925
• Published
• 39
ColorBench: Can VLMs See and Understand the Colorful World? A
Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
Paper
• 2504.10514
• Published
• 48
Does Reinforcement Learning Really Incentivize Reasoning Capacity in
LLMs Beyond the Base Model?
Paper
• 2504.13837
• Published
• 139
Learning to Reason under Off-Policy Guidance
Paper
• 2504.14945
• Published
• 88
Eagle 2.5: Boosting Long-Context Post-Training for Frontier
Vision-Language Models
Paper
• 2504.15271
• Published
• 67
ToolRL: Reward is All Tool Learning Needs
Paper
• 2504.13958
• Published
• 49
Kuwain 1.5B: An Arabic SLM via Language Injection
Paper
• 2504.15120
• Published
• 121
Describe Anything: Detailed Localized Image and Video Captioning
Paper
• 2504.16072
• Published
• 64
Breaking the Modality Barrier: Universal Embedding Learning with
Multimodal LLMs
Paper
• 2504.17432
• Published
• 40
RoboVerse: Towards a Unified Platform, Dataset and Benchmark for
Scalable and Generalizable Robot Learning
Paper
• 2504.18904
• Published
• 9
Voila: Voice-Language Foundation Models for Real-Time Autonomous
Interaction and Voice Role-Play
Paper
• 2505.02707
• Published
• 85
Vision-Language-Action Models: Concepts, Progress, Applications and
Challenges
Paper
• 2505.04769
• Published
• 10
Bielik v3 Small: Technical Report
Paper
• 2505.02550
• Published
• 68
Bielik 11B v2 Technical Report
Paper
• 2505.02410
• Published
• 54
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Paper
• 2505.06111
• Published
• 25
Seed1.5-VL Technical Report
Paper
• 2505.07062
• Published
• 155
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
Paper
• 2505.04410
• Published
• 44
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture,
Training and Dataset
Paper
• 2505.09568
• Published
• 99
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large
Reasoning Models
Paper
• 2505.10554
• Published
• 120
EnerVerse-AC: Envisioning Embodied Environments with Action Condition
Paper
• 2505.09723
• Published
• 23
EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied
World Models
Paper
• 2505.09694
• Published
• 20
Paper
• 2505.09388
• Published
• 335
Chain-of-Model Learning for Language Model
Paper
• 2505.11820
• Published
• 121
AdaptThink: Reasoning Models Can Learn When to Think
Paper
• 2505.13417
• Published
• 83
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta
Correction
Paper
• 2505.11254
• Published
• 48
Faster Video Diffusion with Trainable Sparse Attention
Paper
• 2505.13389
• Published
• 38
Model Merging in Pre-training of Large Language Models
Paper
• 2505.12082
• Published
• 40
Emerging Properties in Unified Multimodal Pretraining
Paper
• 2505.14683
• Published
• 133
SageAttention3: Microscaling FP4 Attention for Inference and An
Exploration of 8-Bit Training
Paper
• 2505.11594
• Published
• 75
Scaling Law for Quantization-Aware Training
Paper
• 2505.14302
• Published
• 76
MMaDA: Multimodal Large Diffusion Language Models
Paper
• 2505.15809
• Published
• 98
Diffusion vs. Autoregressive Language Models: A Text Embedding
Perspective
Paper
• 2505.15045
• Published
• 55
This Time is Different: An Observability Perspective on Time Series
Foundation Models
Paper
• 2505.14766
• Published
• 40
NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop
System from Hypothesis to Verification
Paper
• 2505.16938
• Published
• 121
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement
Learning
Paper
• 2505.16410
• Published
• 58
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Paper
• 2505.16933
• Published
• 34
How Do Large Vision-Language Models See Text in Image? Unveiling the
Distinctive Role of OCR Heads
Paper
• 2505.15865
• Published
• 5
One RL to See Them All: Visual Triple Unified Reinforcement Learning
Paper
• 2505.18129
• Published
• 62
Model Already Knows the Best Noise: Bayesian Active Noise Selection via
Attention in Video Diffusion Model
Paper
• 2505.17561
• Published
• 31
Shifting AI Efficiency From Model-Centric to Data-Centric Compression
Paper
• 2505.19147
• Published
• 145
Mutarjim: Advancing Bidirectional Arabic-English Translation with a
Small Language Model
Paper
• 2505.17894
• Published
• 220
Embodied Agents Meet Personalization: Exploring Memory Utilization for
Personalized Assistance
Paper
• 2505.16348
• Published
• 52
ARM: Adaptive Reasoning Model
Paper
• 2505.20258
• Published
• 45
Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications
of Agentic AI
Paper
• 2505.19443
• Published
• 15
Paper2Poster: Towards Multimodal Poster Automation from Scientific
Papers
Paper
• 2505.21497
• Published
• 109
Exploring the Latent Capacity of LLMs for One-Step Text Generation
Paper
• 2505.21189
• Published
• 61
ATLAS: Learning to Optimally Memorize the Context at Test Time
Paper
• 2505.23735
• Published
• 23
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in
Robotics
Paper
• 2506.00070
• Published
• 29
Paper
• 2506.03569
• Published
• 80
CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark
Paper
• 2505.16968
• Published
• 40
AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment
Paper
• 2506.04089
• Published
• 47
VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code
Generation
Paper
• 2506.03930
• Published
• 26
Qwen3 Embedding: Advancing Text Embedding and Reranking Through
Foundation Models
Paper
• 2506.05176
• Published
• 79
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly
Licensed Text
Paper
• 2506.05209
• Published
• 60
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language
Models for Robotics
Paper
• 2506.04308
• Published
• 43
Reinforcement Pre-Training
Paper
• 2506.08007
• Published
• 263
MiniCPM4: Ultra-Efficient LLMs on End Devices
Paper
• 2506.07900
• Published
• 95
SpatialLM: Training Large Language Models for Structured Indoor Modeling
Paper
• 2506.07491
• Published
• 50
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal
Learning
Paper
• 2506.06205
• Published
• 30
PlayerOne: Egocentric World Simulator
Paper
• 2506.09995
• Published
• 34
Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling
Paradigms for Text-to-Music Generation
Paper
• 2506.08570
• Published
• 33
Paper
• 2506.10892
• Published
• 37
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning
Attention
Paper
• 2506.13585
• Published
• 273
Unified Vision-Language-Action Model
Paper
• 2506.19850
• Published
• 27
WorldVLA: Towards Autoregressive Action World Model
Paper
• 2506.21539
• Published
• 40
Where to find Grokking in LLM Pretraining? Monitor
Memorization-to-Generalization without Test
Paper
• 2506.21551
• Published
• 28
Ark: An Open-source Python-based Framework for Robot Learning
Paper
• 2506.21628
• Published
• 16
Kwai Keye-VL Technical Report
Paper
• 2507.01949
• Published
• 131
Depth Anything at Any Condition
Paper
• 2507.01634
• Published
• 49
A Survey on Vision-Language-Action Models: An Action Tokenization
Perspective
Paper
• 2507.01925
• Published
• 39
T-LoRA: Single Image Diffusion Model Customization Without Overfitting
Paper
• 2507.05964
• Published
• 120
PhysX: Physical-Grounded 3D Asset Generation
Paper
• 2507.12465
• Published
• 44
A Survey of Context Engineering for Large Language Models
Paper
• 2507.13334
• Published
• 261
VisionThink: Smart and Efficient Vision Language Model via Reinforcement
Learning
Paper
• 2507.13348
• Published
• 79
Set Block Decoding is a Language Model Inference Accelerator
Paper
• 2509.04185
• Published
• 54
Why Language Models Hallucinate
Paper
• 2509.04664
• Published
• 196
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action
Model
Paper
• 2509.09372
• Published
• 246
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Paper
• 2509.09674
• Published
• 80
FLOWER: Democratizing Generalist Robot Policies with Efficient
Vision-Language-Action Flow Policies
Paper
• 2509.04996
• Published
• 15
QuantAgent: Price-Driven Multi-Agent LLMs for High-Frequency Trading
Paper
• 2509.09995
• Published
• 16
ByteWrist: A Parallel Robotic Wrist Enabling Flexible and
Anthropomorphic Motion for Confined Spaces
Paper
• 2509.18084
• Published
• 13
Residual Off-Policy RL for Finetuning Behavior Cloning Policies
Paper
• 2509.19301
• Published
• 19
DA^2: Depth Anything in Any Direction
Paper
• 2509.26618
• Published
• 26
LongCodeZip: Compress Long Context for Code Language Models
Paper
• 2510.00446
• Published
• 107
VLA-R1: Enhancing Reasoning in Vision-Language-Action Models
Paper
• 2510.01623
• Published
• 12
ExGRPO: Learning to Reason from Experience
Paper
• 2510.02245
• Published
• 80
Paper2Video: Automatic Video Generation from Scientific Papers
Paper
• 2510.05096
• Published
• 119
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with
Holistic Platform and Adaptive Hybrid Policy Optimization
Paper
• 2510.08540
• Published
• 109
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
Paper
• 2510.07242
• Published
• 30
Reinforcing Diffusion Models by Direct Group Preference Optimization
Paper
• 2510.08425
• Published
• 12
DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via
Joint-Wise Neural Dynamics Model
Paper
• 2510.08556
• Published
• 7
R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized
Manipulation
Paper
• 2510.08547
• Published
• 5
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to
Embodied AI
Paper
• 2510.05684
• Published
• 143
Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence
Reweighting
Paper
• 2510.08696
• Published
• 15
Robot Learning: A Tutorial
Paper
• 2510.12403
• Published
• 123
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action
Models
Paper
• 2510.13626
• Published
• 46
ParallelBench: Understanding the Trade-offs of Parallel Decoding in
Diffusion LLMs
Paper
• 2510.04767
• Published
• 28
The Art of Scaling Reinforcement Learning Compute for LLMs
Paper
• 2510.13786
• Published
• 32
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding
Paper
• 2510.14943
• Published
• 40
VLA^2: Empowering Vision-Language-Action Models with an Agentic
Framework for Unseen Concept Manipulation
Paper
• 2510.14902
• Published
• 17
VLA-0: Building State-of-the-Art VLAs with Zero Modification
Paper
• 2510.13054
• Published
• 16
SimKO: Simple Pass@K Policy Optimization
Paper
• 2510.14807
• Published
• 11
pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation
Paper
• 2510.14974
• Published
• 10
AnyUp: Universal Feature Upsampling
Paper
• 2510.12764
• Published
• 12
Scaling Instruction-Based Video Editing with a High-Quality Synthetic
Dataset
Paper
• 2510.15742
• Published
• 51
LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal
Paper
• 2510.15868
• Published
• 27
RL makes MLLMs see better than SFT
Paper
• 2510.16333
• Published
• 49
Chronos-2: From Univariate to Universal Forecasting
Paper
• 2510.15821
• Published
• 22
Visual Autoregressive Models Beat Diffusion Models on Inference Time
Scaling
Paper
• 2510.16751
• Published
• 21
RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Paper
• 2510.23763
• Published
• 56
Exploring Conditions for Diffusion models in Robotic Control
Paper
• 2510.15510
• Published
• 40
π_RL: Online RL Fine-tuning for Flow-based
Vision-Language-Action Models
Paper
• 2510.25889
• Published
• 66
World Simulation with Video Foundation Models for Physical AI
Paper
• 2511.00062
• Published
• 44
Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete
Denoising Diffusion Process
Paper
• 2511.01718
• Published
• 7
Don't Blind Your VLA: Aligning Visual Representations for OOD
Generalization
Paper
• 2510.25616
• Published
• 105
Robot Learning from a Physical World Model
Paper
• 2511.07416
• Published
• 32
Depth Anything 3: Recovering the Visual Space from Any Views
Paper
• 2511.10647
• Published
• 99
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
Paper
• 2511.15605
• Published
• 24
RynnVLA-002: A Unified Vision-Language-Action and World Model
Paper
• 2511.17502
• Published
• 28
MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots
Paper
• 2511.17889
• Published
• 5
ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
Paper
• 2511.20937
• Published
• 16
Qwen3-VL Technical Report
Paper
• 2511.21631
• Published
• 158
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach
Paper
• 2512.02834
• Published
• 41
Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment
Paper
• 2511.22345
• Published
• 13
VideoVLA: Video Generators Can Be Generalizable Robot Manipulators
Paper
• 2512.06963
• Published
• 4
HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models
Paper
• 2512.09928
• Published
• 14
LEO-RobotAgent: A General-purpose Robotic Agent for Language-driven Embodied Operator
Paper
• 2512.10605
• Published
• 7
Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge
Paper
• 2512.06951
• Published
• 4
Memory in the Age of AI Agents
Paper
• 2512.13564
• Published
• 151
Openpi Comet: Competition Solution For 2025 BEHAVIOR Challenge
Paper
• 2512.10071
• Published
• 18
VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer
Paper
• 2512.11891
• Published
• 10
Learning Robot Manipulation from Audio World Models
Paper
• 2512.08405
• Published
• 2
LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry
Paper
• 2512.19629
• Published
• 26
SOP: A Scalable Online Post-Training System for Vision-Language-Action Models
Paper
• 2601.03044
• Published
• 28
E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models
Paper
• 2601.00423
• Published
• 11
AT^2PO: Agentic Turn-based Policy Optimization via Tree Search
Paper
• 2601.04767
• Published
• 28
Beyond Binary Preference: Aligning Diffusion Models to Fine-grained Criteria by Decoupling Attributes
Paper
• 2601.04300
• Published
• 3
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Paper
• 2601.05242
• Published
• 228
RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation
Paper
• 2601.05241
• Published
• 24
Solar Open Technical Report
Paper
• 2601.07022
• Published
• 65
Paper
• 2601.08584
• Published
• 54
ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
Paper
• 2512.24965
• Published
• 42
FlowAct-R1: Towards Interactive Humanoid Video Generation
Paper
• 2601.10103
• Published
• 74
Action100M: A Large-scale Video Action Dataset
Paper
• 2601.10592
• Published
• 29
ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models
Paper
• 2601.11404
• Published
• 26
FrankenMotion: Part-level Human Motion Generation and Composition
Paper
• 2601.10909
• Published
• 18
Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization
Paper
• 2601.12993
• Published
• 75
BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
Paper
• 2601.15197
• Published
• 54
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Paper
• 2601.16163
• Published
• 14
TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers
Paper
• 2601.14133
• Published
• 61
A Pragmatic VLA Foundation Model
Paper
• 2601.18692
• Published
• 47
Advancing Open-source World Models
Paper
• 2601.20540
• Published
• 128
DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation
Paper
• 2601.22153
• Published
• 71
Beyond Imitation: Reinforcement Learning for Active Latent Planning
Paper
• 2601.21598
• Published
• 9
DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment
Paper
• 2601.20218
• Published
• 15
Green-VLA: Staged Vision-Language-Action Model for Generalist Robots
Paper
• 2602.00919
• Published
• 305
SoMA: A Real-to-Sim Neural Simulator for Robotic Soft-body Manipulation
Paper
• 2602.02402
• Published
• 32
VLS: Steering Pretrained Robot Policies via Vision-Language Models
Paper
• 2602.03973
• Published
• 22
VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
Paper
• 2602.10098
• Published
• 18
PhyCritic: Multimodal Critic Models for Physical AI
Paper
• 2602.11124
• Published
• 52
GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning
Paper
• 2602.12099
• Published
• 57
χ_{0}: Resource-Aware Robust Manipulation via Taming Distributional Inconsistencies
Paper
• 2602.09021
• Published
• 25
RISE: Self-Improving Robot Policy with Compositional World Model
Paper
• 2602.11075
• Published
• 30
EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration
Paper
• 2602.10106
• Published
• 21
Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation
Paper
• 2602.16705
• Published
• 26
RynnBrain: Open Embodied Foundation Models
Paper
• 2602.14979
• Published
• 42
World Action Models are Zero-shot Policies
Paper
• 2602.15922
• Published
• 11
TactAlign: Human-to-Robot Policy Transfer via Tactile Alignment
Paper
• 2602.13579
• Published
• 10
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
Paper
• 2602.20309
• Published
• 10