readings - a GeonmoGu Collection

CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation

Paper • 2408.14572 • Published Aug 26, 2024 • 8

SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding

Paper • 2408.15545 • Published Aug 28, 2024 • 38

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Paper • 2409.02889 • Published Sep 4, 2024 • 54

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

Paper • 2409.02897 • Published Sep 4, 2024 • 48

Attention Heads of Large Language Models: A Survey

Paper • 2409.03752 • Published Sep 5, 2024 • 92

Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

Paper • 2409.01322 • Published Sep 2, 2024 • 96

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Paper • 2409.02795 • Published Sep 4, 2024 • 72

Paper Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance

Paper • 2409.04593 • Published Sep 6, 2024 • 26

ProteinBench: A Holistic Evaluation of Protein Foundation Models

Paper • 2409.06744 • Published Sep 10, 2024 • 8

Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 153

Training Language Models to Self-Correct via Reinforcement Learning

Paper • 2409.12917 • Published Sep 19, 2024 • 140

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

Paper • 2409.16191 • Published Sep 24, 2024 • 41

Making Text Embedders Few-Shot Learners

Paper • 2409.15700 • Published Sep 24, 2024 • 29

Instruction Following without Instruction Tuning

Paper • 2409.14254 • Published Sep 21, 2024 • 29

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Paper • 2410.00531 • Published Oct 1, 2024 • 33

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Paper • 2410.01215 • Published Oct 2, 2024 • 39

Not All LLM Reasoners Are Created Equal

Paper • 2410.01748 • Published Oct 2, 2024 • 29

RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

Paper • 2410.01044 • Published Oct 1, 2024 • 35

Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

Paper • 2410.02749 • Published Oct 3, 2024 • 13

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

Paper • 2410.02367 • Published Oct 3, 2024 • 50

Addition is All You Need for Energy-efficient Language Models

Paper • 2410.00907 • Published Oct 1, 2024 • 151

Selective Attention Improves Transformer

Paper • 2410.02703 • Published Oct 3, 2024 • 25

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Paper • 2410.08164 • Published Oct 10, 2024 • 26

Toward General Instruction-Following Alignment for Retrieval-Augmented Generation

Paper • 2410.09584 • Published Oct 12, 2024 • 48

A Unified View of Delta Parameter Editing in Post-Trained Large-Scale Models

Paper • 2410.13841 • Published Oct 17, 2024 • 16

HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks

Paper • 2410.12381 • Published Oct 16, 2024 • 43

Revealing the Barriers of Language Agents in Planning

Paper • 2410.12409 • Published Oct 16, 2024 • 27

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

Paper • 2410.17243 • Published Oct 22, 2024 • 92

Why Does the Effective Context Length of LLMs Fall Short?

Paper • 2410.18745 • Published Oct 24, 2024 • 17

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset

Paper • 2410.22325 • Published Oct 29, 2024 • 10

A Large Recurrent Action Model: xLSTM enables Fast Inference for Robotics Tasks

Paper • 2410.22391 • Published Oct 29, 2024 • 22

Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

Paper • 2411.03823 • Published Nov 6, 2024 • 49

Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Paper • 2411.03562 • Published Nov 5, 2024 • 69

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

Paper • 2411.02959 • Published Nov 5, 2024 • 71

Let the Flows Tell: Solving Graph Combinatorial Optimization Problems with GFlowNets

Paper • 2305.17010 • Published May 26, 2023

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

Paper • 2411.04905 • Published Nov 7, 2024 • 127

Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study

Paper • 2411.02462 • Published Nov 4, 2024 • 10

Large Language Models Can Self-Improve in Long-context Reasoning

Paper • 2411.08147 • Published Nov 12, 2024 • 65

Cut Your Losses in Large-Vocabulary Language Models

Paper • 2411.09009 • Published Nov 13, 2024 • 49

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

Paper • 2411.06469 • Published Nov 10, 2024 • 17

SlimLM: An Efficient Small Language Model for On-Device Document Assistance

Paper • 2411.09944 • Published Nov 15, 2024 • 12

SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

Paper • 2411.10958 • Published Nov 17, 2024 • 57

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15, 2024 • 87

Hymba: A Hybrid-head Architecture for Small Language Models

Paper • 2411.13676 • Published Nov 20, 2024 • 47

Natural Language Reinforcement Learning

Paper • 2411.14251 • Published Nov 21, 2024 • 31

Cautious Optimizers: Improving Training with One Line of Code

Paper • 2411.16085 • Published Nov 25, 2024 • 19

Predicting Emergent Capabilities by Finetuning

Paper • 2411.16035 • Published Nov 25, 2024 • 7

Star Attention: Efficient LLM Inference over Long Sequences

Paper • 2411.17116 • Published Nov 26, 2024 • 53

o1-Coder: an o1 Replication for Coding

Paper • 2412.00154 • Published Nov 29, 2024 • 44

Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability

Paper • 2411.19943 • Published Nov 29, 2024 • 62

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 117

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

Paper • 2412.04455 • Published Dec 5, 2024 • 38

Personalized Multimodal Large Language Models: A Survey

Paper • 2412.02142 • Published Dec 3, 2024 • 13

Evaluating Language Models as Synthetic Data Generators

Paper • 2412.03679 • Published Dec 4, 2024 • 47

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 160

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Paper • 2412.05237 • Published Dec 6, 2024 • 46

EXAONE 3.5: Series of Large Language Models for Real-world Use Cases

Paper • 2412.04862 • Published Dec 6, 2024 • 50

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation

Paper • 2412.04445 • Published Dec 5, 2024 • 22

Evaluating and Aligning CodeLLMs on Human Preference

Paper • 2412.05210 • Published Dec 6, 2024 • 50

POINTS1.5: Building a Vision-Language Model towards Real World Applications

Paper • 2412.08443 • Published Dec 11, 2024 • 38

Phi-4 Technical Report

Paper • 2412.08905 • Published Dec 12, 2024 • 122

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Paper • 2412.09596 • Published Dec 12, 2024 • 97

GenEx: Generating an Explorable World

Paper • 2412.09624 • Published Dec 12, 2024 • 98

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Paper • 2412.14161 • Published Dec 18, 2024 • 51

Qwen2.5 Technical Report

Paper • 2412.15115 • Published Dec 19, 2024 • 377

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Paper • 2412.15204 • Published Dec 19, 2024 • 38

How to Synthesize Text Data without Model Collapse?

Paper • 2412.14689 • Published Dec 19, 2024 • 53

Offline Reinforcement Learning for LLM Multi-Step Reasoning

Paper • 2412.16145 • Published Dec 20, 2024 • 38

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

Paper • 2412.13649 • Published Dec 18, 2024 • 21

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

Paper • 2412.17256 • Published Dec 23, 2024 • 47

RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response

Paper • 2412.14922 • Published Dec 19, 2024 • 88

Diving into Self-Evolving Training for Multimodal Reasoning

Paper • 2412.17451 • Published Dec 23, 2024 • 42

Revisiting In-Context Learning with Long Context Language Models

Paper • 2412.16926 • Published Dec 22, 2024 • 32

Outcome-Refining Process Supervision for Code Generation

Paper • 2412.15118 • Published Dec 19, 2024 • 19

DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought

Paper • 2412.17498 • Published Dec 23, 2024 • 22

NILE: Internal Consistency Alignment in Large Language Models

Paper • 2412.16686 • Published Dec 21, 2024 • 8

LearnLM: Improving Gemini for Learning

Paper • 2412.16429 • Published Dec 21, 2024 • 22

PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World

Paper • 2412.17589 • Published Dec 23, 2024 • 14

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Paper • 2412.18450 • Published Dec 24, 2024 • 36

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Paper • 2412.17739 • Published Dec 23, 2024 • 41

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

Paper • 2412.14711 • Published Dec 19, 2024 • 16

Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Paper • 2412.15797 • Published Dec 20, 2024 • 18

YuLan-Mini: An Open Data-efficient Language Model

Paper • 2412.17743 • Published Dec 23, 2024 • 66

Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation

Paper • 2412.18176 • Published Dec 24, 2024 • 16

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Paper • 2412.18072 • Published Dec 24, 2024 • 18

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Paper • 2412.18525 • Published Dec 24, 2024 • 74

Efficiently Serving LLM Reasoning Programs with Certaindex

Paper • 2412.20993 • Published Dec 30, 2024 • 36

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Paper • 2412.21199 • Published Dec 30, 2024 • 13

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Paper • 2412.19723 • Published Dec 27, 2024 • 87

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1, 2025 • 109

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

Paper • 2501.01257 • Published Jan 2, 2025 • 51

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Paper • 2501.01423 • Published Jan 2, 2025 • 44

ProgCo: Program Helps Self-Correction of Large Language Models

Paper • 2501.01264 • Published Jan 2, 2025 • 26

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Paper • 2501.02976 • Published Jan 6, 2025 • 56

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

Paper • 2501.03226 • Published Jan 6, 2025 • 43

Test-time Computing: from System-1 Thinking to System-2 Thinking

Paper • 2501.02497 • Published Jan 5, 2025 • 45

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

Paper • 2501.03262 • Published Jan 4, 2025 • 104

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Paper • 2501.02955 • Published Jan 6, 2025 • 44

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52

Cosmos World Foundation Model Platform for Physical AI

Paper • 2501.03575 • Published Jan 7, 2025 • 82

PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides

Paper • 2501.03936 • Published Jan 7, 2025 • 23

An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41

Enhancing Human-Like Responses in Large Language Models

Paper • 2501.05032 • Published Jan 9, 2025 • 61

SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution

Paper • 2501.05040 • Published Jan 9, 2025 • 15

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Paper • 2501.06186 • Published Jan 10, 2025 • 65

VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10, 2025 • 75

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Paper • 2501.05510 • Published Jan 9, 2025 • 44

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Paper • 2501.07301 • Published Jan 13, 2025 • 100

Tensor Product Attention Is All You Need

Paper • 2501.06425 • Published Jan 11, 2025 • 90

Transformer^2: Self-adaptive LLMs

Paper • 2501.06252 • Published Jan 9, 2025 • 55

WebWalker: Benchmarking LLMs in Web Traversal

Paper • 2501.07572 • Published Jan 13, 2025 • 23

O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning

Paper • 2501.06458 • Published Jan 11, 2025 • 31

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14, 2025 • 62

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Paper • 2501.08828 • Published Jan 15, 2025 • 30

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Paper • 2501.09732 • Published Jan 16, 2025 • 72

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Paper • 2501.09686 • Published Jan 16, 2025 • 41

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Paper • 2501.09747 • Published Jan 16, 2025 • 28

Evolving Deeper LLM Thinking

Paper • 2501.09891 • Published Jan 17, 2025 • 115

PaSa: An LLM Agent for Comprehensive Academic Paper Search

Paper • 2501.10120 • Published Jan 17, 2025 • 54

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

Paper • 2501.11425 • Published Jan 20, 2025 • 109

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

Paper • 2501.11873 • Published Jan 21, 2025 • 67

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper • 2501.12948 • Published Jan 22, 2025 • 440

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Paper • 2501.12895 • Published Jan 22, 2025 • 61

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Paper • 2501.13106 • Published Jan 22, 2025 • 90

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Paper • 2501.12599 • Published Jan 22, 2025 • 126

Autonomy-of-Experts Models

Paper • 2501.13074 • Published Jan 22, 2025 • 44

SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

Paper • 2501.13200 • Published Jan 22, 2025 • 69

Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models

Paper • 2501.13629 • Published Jan 23, 2025 • 48

Baichuan-Omni-1.5 Technical Report

Paper • 2501.15368 • Published Jan 26, 2025 • 60

Qwen2.5-1M Technical Report

Paper • 2501.15383 • Published Jan 26, 2025 • 72

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Paper • 2501.16975 • Published Jan 28, 2025 • 32

Optimizing Large Language Model Training Using FP4 Quantization

Paper • 2501.17116 • Published Jan 28, 2025 • 36

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Paper • 2501.17161 • Published Jan 28, 2025 • 124

Atla Selene Mini: A General Purpose Evaluation Model

Paper • 2501.17195 • Published Jan 27, 2025 • 35

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

Paper • 2501.17703 • Published Jan 29, 2025 • 59

Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

Paper • 2501.18585 • Published Jan 30, 2025 • 61

s1: Simple test-time scaling

Paper • 2501.19393 • Published Jan 31, 2025 • 124

Reward-Guided Speculative Decoding for Efficient LLM Reasoning

Paper • 2501.19324 • Published Jan 31, 2025 • 39

GuardReasoner: Towards Reasoning-based LLM Safeguards

Paper • 2501.18492 • Published Jan 30, 2025 • 88

The Differences Between Direct Alignment Algorithms are a Blur

Paper • 2502.01237 • Published Feb 3, 2025 • 113

Process Reinforcement through Implicit Rewards

Paper • 2502.01456 • Published Feb 3, 2025 • 62

The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles

Paper • 2502.01081 • Published Feb 3, 2025 • 13

Scaling Embedding Layers in Language Models

Paper • 2502.01637 • Published Feb 3, 2025 • 24

Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

Paper • 2502.02339 • Published Feb 4, 2025 • 23

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

Paper • 2502.01105 • Published Feb 3, 2025 • 21

Large Language Model Guided Self-Debugging Code Generation

Paper • 2502.02928 • Published Feb 5, 2025 • 13

TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets

Paper • 2502.01506 • Published Feb 3, 2025 • 38

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4, 2025 • 255

Demystifying Long Chain-of-Thought Reasoning in LLMs

Paper • 2502.03373 • Published Feb 5, 2025 • 58

LIMO: Less is More for Reasoning

Paper • 2502.03387 • Published Feb 5, 2025 • 62

ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

Paper • 2502.04320 • Published Feb 6, 2025 • 36

Enhancing Code Generation for Low-Resource Languages: No Silver Bullet

Paper • 2501.19085 • Published Jan 31, 2025 • 5

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

Paper • 2502.06703 • Published Feb 10, 2025 • 152

SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators

Paper • 2502.06394 • Published Feb 10, 2025 • 89

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

Paper • 2502.06781 • Published Feb 10, 2025 • 58

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

Paper • 2502.05609 • Published Feb 8, 2025 • 18

Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation

Paper • 2502.05415 • Published Feb 8, 2025 • 20

LM2: Large Memory Models

Paper • 2502.06049 • Published Feb 9, 2025 • 31

The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering

Paper • 2502.03628 • Published Feb 5, 2025 • 12

Matryoshka Quantization

Paper • 2502.06786 • Published Feb 10, 2025 • 32

History-Guided Video Diffusion

Paper • 2502.06764 • Published Feb 10, 2025 • 12

CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers

Paper • 2502.06527 • Published Feb 10, 2025 • 11

The Curse of Depth in Large Language Models

Paper • 2502.05795 • Published Feb 9, 2025 • 40

MetaChain: A Fully-Automated and Zero-Code Framework for LLM Agents

Paper • 2502.05957 • Published Feb 9, 2025 • 15

Competitive Programming with Large Reasoning Models

Paper • 2502.06807 • Published Feb 3, 2025 • 69

CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction

Paper • 2502.07316 • Published Feb 11, 2025 • 50

Teaching Language Models to Critique via Reinforcement Learning

Paper • 2502.03492 • Published Feb 5, 2025 • 24

Expect the Unexpected: FailSafe Long Context QA for Finance

Paper • 2502.06329 • Published Feb 10, 2025 • 133

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Paper • 2502.07617 • Published Feb 11, 2025 • 29

LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

Paper • 2502.07374 • Published Feb 11, 2025 • 40

Retrieval-augmented Large Language Models for Financial Time Series Forecasting

Paper • 2502.05878 • Published Feb 9, 2025 • 40

Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Paper • 2502.06589 • Published Feb 10, 2025 • 21

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon

Paper • 2502.07445 • Published Feb 11, 2025 • 11

TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 57

Distillation Scaling Laws

Paper • 2502.08606 • Published Feb 12, 2025 • 47

LLM Pretraining with Continuous Concepts

Paper • 2502.08524 • Published Feb 12, 2025 • 30

InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU

Paper • 2502.08910 • Published Feb 13, 2025 • 148

Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

Paper • 2502.08690 • Published Feb 12, 2025 • 43

SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models

Paper • 2502.09604 • Published Feb 13, 2025 • 37

Exploring the Potential of Encoder-free Architectures in 3D LMMs

Paper • 2502.09620 • Published Feb 13, 2025 • 26

Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging -- An Open Recipe

Paper • 2502.09056 • Published Feb 13, 2025 • 31

Logical Reasoning in Large Language Models: A Survey

Paper • 2502.09100 • Published Feb 13, 2025 • 24

DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References

Paper • 2502.09614 • Published Feb 13, 2025 • 9

Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights

Paper • 2502.09619 • Published Feb 13, 2025 • 36

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Paper • 2502.09560 • Published Feb 13, 2025 • 35

The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

Paper • 2502.08946 • Published Feb 13, 2025 • 191

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

Paper • 2502.09696 • Published Feb 13, 2025 • 43

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Paper • 2502.10391 • Published Feb 14, 2025 • 34

Diverse Inference and Verification for Advanced Reasoning

Paper • 2502.09955 • Published Feb 14, 2025 • 18

AdaPTS: Adapting Univariate Foundation Models to Probabilistic Multivariate Time Series Forecasting

Paper • 2502.10235 • Published Feb 14, 2025 • 9

We Can't Understand AI Using our Existing Vocabulary

Paper • 2502.07586 • Published Feb 11, 2025 • 11

FoNE: Precise Single-Token Number Embeddings via Fourier Features

Paper • 2502.09741 • Published Feb 13, 2025 • 15

Region-Adaptive Sampling for Diffusion Transformers

Paper • 2502.10389 • Published Feb 14, 2025 • 53

Large Language Diffusion Models

Paper • 2502.09992 • Published Feb 14, 2025 • 126

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Paper • 2502.11089 • Published Feb 16, 2025 • 167

Learning Getting-Up Policies for Real-World Humanoid Robots

Paper • 2502.12152 • Published Feb 17, 2025 • 42

ReLearn: Unlearning via Learning for Large Language Models

Paper • 2502.11190 • Published Feb 16, 2025 • 30

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Paper • 2502.12115 • Published Feb 17, 2025 • 46

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

Paper • 2502.12148 • Published Feb 17, 2025 • 17

How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training

Paper • 2502.11196 • Published Feb 16, 2025 • 23

SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

Paper • 2502.11167 • Published Feb 16, 2025 • 10

Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening

Paper • 2502.12146 • Published Feb 17, 2025 • 16

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

Paper • 2502.10458 • Published Feb 12, 2025 • 38

Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Paper • 2502.11831 • Published Feb 17, 2025 • 20

CRANE: Reasoning with constrained LLM generation

Paper • 2502.09061 • Published Feb 13, 2025 • 21

Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning

Paper • 2502.10550 • Published Feb 14, 2025 • 8

MagicArticulate: Make Your 3D Models Articulation-Ready

Paper • 2502.12135 • Published Feb 17, 2025 • 8

Soundwave: Less is More for Speech-Text Alignment in LLMs

Paper • 2502.12900 • Published Feb 18, 2025 • 86

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Paper • 2502.13143 • Published Feb 18, 2025 • 31

Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

Paper • 2502.13145 • Published Feb 18, 2025 • 38

FLAG-Trader: Fusion LLM-Agent with Gradient-based Reinforcement Learning for Financial Trading

Paper • 2502.11433 • Published Feb 17, 2025 • 36

You Do Not Fully Utilize Transformer's Representation Capacity

Paper • 2502.09245 • Published Feb 13, 2025 • 37

Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18, 2025 • 58

Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?

Paper • 2502.12215 • Published Feb 17, 2025 • 16

Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published Feb 19, 2025 • 214

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Paper • 2502.13128 • Published Feb 18, 2025 • 41

Craw4LLM: Efficient Web Crawling for LLM Pretraining

Paper • 2502.13347 • Published Feb 19, 2025 • 30

Small Models Struggle to Learn from Strong Reasoners

Paper • 2502.12143 • Published Feb 17, 2025 • 39

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

Paper • 2502.13962 • Published Feb 19, 2025 • 28

AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence

Paper • 2502.13943 • Published Feb 19, 2025 • 8

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper • 2502.14786 • Published Feb 20, 2025 • 158

S*: Test Time Scaling for Code Generation

Paper • 2502.14382 • Published Feb 20, 2025 • 63

How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?

Paper • 2502.14502 • Published Feb 20, 2025 • 91

Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information

Paper • 2502.14258 • Published Feb 20, 2025 • 26

VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

Paper • 2502.12084 • Published Feb 17, 2025 • 35

SurveyX: Academic Survey Automation via Large Language Models

Paper • 2502.14776 • Published Feb 20, 2025 • 100

Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment

Paper • 2502.16894 • Published Feb 24, 2025 • 32

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

Paper • 2502.17157 • Published Feb 24, 2025 • 52

VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing

Paper • 2502.17258 • Published Feb 24, 2025 • 79

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

Paper • 2502.18137 • Published Feb 25, 2025 • 60

Kanana: Compute-efficient Bilingual Language Models

Paper • 2502.18934 • Published Feb 26, 2025 • 65

Self-rewarding correction for mathematical reasoning

Paper • 2502.19613 • Published Feb 26, 2025 • 82

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Paper • 2503.01743 • Published Mar 3, 2025 • 89

Visual-RFT: Visual Reinforcement Fine-Tuning

Paper • 2503.01785 • Published Mar 3, 2025 • 86

CodeArena: A Collective Evaluation Platform for LLM Code Generation

Paper • 2503.01295 • Published Mar 3, 2025 • 8

Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers

Paper • 2503.00865 • Published Mar 2, 2025 • 64

ABC: Achieving Better Control of Multimodal Embeddings using VLMs

Paper • 2503.00329 • Published Mar 1, 2025 • 20

Token-Efficient Long Video Understanding for Multimodal LLMs

Paper • 2503.04130 • Published Mar 6, 2025 • 96

EgoLife: Towards Egocentric Life Assistant

Paper • 2503.03803 • Published Mar 5, 2025 • 46

START: Self-taught Reasoner with Tools

Paper • 2503.04625 • Published Mar 6, 2025 • 113

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

Paper • 2503.04724 • Published Mar 6, 2025 • 72

Unified Reward Model for Multimodal Understanding and Generation

Paper • 2503.05236 • Published Mar 7, 2025 • 123

Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching

Paper • 2503.05179 • Published Mar 7, 2025 • 46

Forgetting Transformer: Softmax Attention with a Forget Gate

Paper • 2503.02130 • Published Mar 3, 2025 • 32

BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities

Paper • 2503.05652 • Published Mar 7, 2025 • 11

LoRACode: LoRA Adapters for Code Embeddings

Paper • 2503.05315 • Published Mar 7, 2025 • 13

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Paper • 2503.03601 • Published Mar 5, 2025 • 232

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

Paper • 2503.06680 • Published Mar 9, 2025 • 20

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Paper • 2503.07536 • Published Mar 10, 2025 • 88

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Paper • 2503.08625 • Published Mar 11, 2025 • 27

Gemini Embedding: Generalizable Embeddings from Gemini

Paper • 2503.07891 • Published Mar 10, 2025 • 45

Implicit Reasoning in Transformers is Reasoning through Shortcuts

Paper • 2503.07604 • Published Mar 10, 2025 • 23

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

Paper • 2503.07572 • Published Mar 10, 2025 • 48

AnyMoLe: Any Character Motion In-betweening Leveraging Video Diffusion Models

Paper • 2503.08417 • Published Mar 11, 2025 • 8

Mixture of Experts Made Intrinsically Interpretable

Paper • 2503.07639 • Published Mar 5, 2025 • 10

AI-native Memory 2.0: Second Me

Paper • 2503.08102 • Published Mar 11, 2025 • 13

TPDiff: Temporal Pyramid Video Diffusion Model

Paper • 2503.09566 • Published Mar 12, 2025 • 45

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Paper • 2503.09573 • Published Mar 12, 2025 • 76

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

Paper • 2503.08525 • Published Mar 11, 2025 • 17

Quantizing Large Language Models for Code Generation: A Differentiated Replication

Paper • 2503.07103 • Published Mar 10, 2025 • 8

Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning

Paper • 2503.11646 • Published Mar 14, 2025 • 34

Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

Paper • 2503.14492 • Published Mar 18, 2025 • 20

φ-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation

Paper • 2503.13288 • Published Mar 17, 2025 • 51

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Paper • 2503.16419 • Published Mar 20, 2025 • 77

DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

Paper • 2503.14487 • Published Mar 18, 2025 • 28

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Paper • 2503.15558 • Published Mar 18, 2025 • 50

Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

Paper • 2503.16057 • Published Mar 20, 2025 • 14

Tokenize Image as a Set

Paper • 2503.16425 • Published Mar 20, 2025 • 16

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

Paper • 2503.15242 • Published Mar 19, 2025 • 10

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Paper • 2503.16365 • Published Mar 20, 2025 • 41

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

Paper • 2503.16219 • Published Mar 20, 2025 • 52

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Paper • 2503.18878 • Published Mar 24, 2025 • 119

Video-T1: Test-Time Scaling for Video Generation

Paper • 2503.18942 • Published Mar 24, 2025 • 90

AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning

Paper • 2503.18769 • Published Mar 24, 2025 • 11

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Paper • 2503.18931 • Published Mar 24, 2025 • 30

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Paper • 2503.19325 • Published Mar 25, 2025 • 73

Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

Paper • 2503.19622 • Published Mar 25, 2025 • 31

Scaling Vision Pre-Training to 4K Resolution

Paper • 2503.19903 • Published Mar 25, 2025 • 41

UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

Paper • 2503.21620 • Published Mar 27, 2025 • 62

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Paper • 2503.21460 • Published Mar 27, 2025 • 83

ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation

Paper • 2503.21729 • Published Mar 27, 2025 • 29

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Paper • 2503.21696 • Published Mar 27, 2025 • 23

Think Before Recommend: Unleashing the Latent Reasoning Power for Sequential Recommendation

Paper • 2503.22675 • Published Mar 28, 2025 • 36

Segment Any Motion in Videos

Paper • 2503.22268 • Published Mar 28, 2025 • 19

Your ViT is Secretly an Image Segmentation Model

Paper • 2503.19108 • Published Mar 24, 2025 • 25

What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Paper • 2503.24235 • Published Mar 31, 2025 • 54

Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Paper • 2503.24379 • Published Mar 31, 2025 • 76

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Paper • 2503.24376 • Published Mar 31, 2025 • 38

Z1: Efficient Test-time Scaling with Code

Paper • 2504.00810 • Published Apr 1, 2025 • 26

Multi-Token Attention

Paper • 2504.00927 • Published Apr 1, 2025 • 56

Scaling Language-Free Visual Representation Learning

Paper • 2504.01017 • Published Apr 1, 2025 • 32

Command A: An Enterprise-Ready Large Language Model

Paper • 2504.00698 • Published Apr 1, 2025 • 29

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

Paper • 2504.00999 • Published Apr 1, 2025 • 95

ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

Paper • 2504.00824 • Published Apr 1, 2025 • 43

PaperBench: Evaluating AI's Ability to Replicate AI Research

Paper • 2504.01848 • Published Apr 2, 2025 • 37

Articulated Kinematics Distillation from Video Diffusion Models

Paper • 2504.01204 • Published Apr 1, 2025 • 23

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Paper • 2504.01990 • Published Mar 31, 2025 • 303

ZClip: Adaptive Spike Mitigation for LLM Pre-Training

Paper • 2504.02507 • Published Apr 3, 2025 • 88

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7, 2025 • 205

URECA: Unique Region Caption Anything

Paper • 2504.05305 • Published Apr 7, 2025 • 35

DDT: Decoupled Diffusion Transformer

Paper • 2504.05741 • Published Apr 8, 2025 • 77

RobustDexGrasp: Robust Dexterous Grasping of General Objects from Single-view Perception

Paper • 2504.05287 • Published Apr 7, 2025 • 6

Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10, 2025 • 137

DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning

Paper • 2504.07128 • Published Apr 2, 2025 • 87

Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Paper • 2504.08685 • Published Apr 11, 2025 • 130

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

Paper • 2504.08736 • Published Apr 11, 2025 • 46

MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

Paper • 2504.08388 • Published Apr 11, 2025 • 42

PixelFlow: Pixel-Space Generative Models with Flow

Paper • 2504.07963 • Published Apr 10, 2025 • 18

Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models

Paper • 2504.05262 • Published Apr 7, 2025 • 11

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

Paper • 2504.05303 • Published Apr 7, 2025 • 5

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 306

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

Paper • 2504.08791 • Published Apr 7, 2025 • 139

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

Paper • 2504.09925 • Published Apr 14, 2025 • 39

ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

Paper • 2504.10514 • Published Apr 10, 2025 • 48

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Paper • 2504.13837 • Published Apr 18, 2025 • 139

Learning to Reason under Off-Policy Guidance

Paper • 2504.14945 • Published Apr 21, 2025 • 88

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

Paper • 2504.15271 • Published Apr 21, 2025 • 67

ToolRL: Reward is All Tool Learning Needs

Paper • 2504.13958 • Published Apr 16, 2025 • 49

Kuwain 1.5B: An Arabic SLM via Language Injection

Paper • 2504.15120 • Published Apr 21, 2025 • 121

Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22, 2025 • 64

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

Paper • 2504.17432 • Published Apr 24, 2025 • 40

RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning

Paper • 2504.18904 • Published Apr 26, 2025 • 9

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

Paper • 2505.02707 • Published May 5, 2025 • 85

Vision-Language-Action Models: Concepts, Progress, Applications and Challenges

Paper • 2505.04769 • Published May 7, 2025 • 10

Bielik v3 Small: Technical Report

Paper • 2505.02550 • Published May 5, 2025 • 68

Bielik 11B v2 Technical Report

Paper • 2505.02410 • Published May 5, 2025 • 54

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Paper • 2505.06111 • Published May 9, 2025 • 25

Seed1.5-VL Technical Report

Paper • 2505.07062 • Published May 11, 2025 • 155

DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

Paper • 2505.04410 • Published May 7, 2025 • 44

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Paper • 2505.09568 • Published May 14, 2025 • 99

Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models

Paper • 2505.10554 • Published May 15, 2025 • 120

EnerVerse-AC: Envisioning Embodied Environments with Action Condition

Paper • 2505.09723 • Published May 14, 2025 • 23

EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models

Paper • 2505.09694 • Published May 14, 2025 • 20

Qwen3 Technical Report

Paper • 2505.09388 • Published May 14, 2025 • 335

Chain-of-Model Learning for Language Model

Paper • 2505.11820 • Published May 17, 2025 • 121

AdaptThink: Reasoning Models Can Learn When to Think

Paper • 2505.13417 • Published May 19, 2025 • 83

Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction

Paper • 2505.11254 • Published May 16, 2025 • 48

Faster Video Diffusion with Trainable Sparse Attention

Paper • 2505.13389 • Published May 19, 2025 • 38

Model Merging in Pre-training of Large Language Models

Paper • 2505.12082 • Published May 17, 2025 • 40

Emerging Properties in Unified Multimodal Pretraining

Paper • 2505.14683 • Published May 20, 2025 • 133

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Paper • 2505.11594 • Published May 16, 2025 • 75

Scaling Law for Quantization-Aware Training

Paper • 2505.14302 • Published May 20, 2025 • 76

MMaDA: Multimodal Large Diffusion Language Models

Paper • 2505.15809 • Published May 21, 2025 • 98

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

Paper • 2505.15045 • Published May 21, 2025 • 55

This Time is Different: An Observability Perspective on Time Series Foundation Models

Paper • 2505.14766 • Published May 20, 2025 • 40

NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification

Paper • 2505.16938 • Published May 22, 2025 • 121

Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning

Paper • 2505.16410 • Published May 22, 2025 • 58

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Paper • 2505.16933 • Published May 22, 2025 • 34

How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads

Paper • 2505.15865 • Published May 21, 2025 • 5

One RL to See Them All: Visual Triple Unified Reinforcement Learning

Paper • 2505.18129 • Published May 23, 2025 • 62

Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model

Paper • 2505.17561 • Published May 23, 2025 • 31

Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Paper • 2505.19147 • Published May 25, 2025 • 145

Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model

Paper • 2505.17894 • Published May 23, 2025 • 220

Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance

Paper • 2505.16348 • Published May 22, 2025 • 52

ARM: Adaptive Reasoning Model

Paper • 2505.20258 • Published May 26, 2025 • 45

Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI

Paper • 2505.19443 • Published May 26, 2025 • 15

Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Paper • 2505.21497 • Published May 27, 2025 • 109

Exploring the Latent Capacity of LLMs for One-Step Text Generation

Paper • 2505.21189 • Published May 27, 2025 • 61

ATLAS: Learning to Optimally Memorize the Context at Test Time

Paper • 2505.23735 • Published May 29, 2025 • 23

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

Paper • 2506.00070 • Published May 29, 2025 • 29

MiMo-VL Technical Report

Paper • 2506.03569 • Published Jun 4, 2025 • 80

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

Paper • 2505.16968 • Published May 22, 2025 • 40

AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment

Paper • 2506.04089 • Published Jun 4, 2025 • 47

VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation

Paper • 2506.03930 • Published Jun 4, 2025 • 26

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Paper • 2506.05176 • Published Jun 5, 2025 • 79

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5, 2025 • 60

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Paper • 2506.04308 • Published Jun 4, 2025 • 43

Reinforcement Pre-Training

Paper • 2506.08007 • Published Jun 9, 2025 • 263

MiniCPM4: Ultra-Efficient LLMs on End Devices

Paper • 2506.07900 • Published Jun 9, 2025 • 95

SpatialLM: Training Large Language Models for Structured Indoor Modeling

Paper • 2506.07491 • Published Jun 9, 2025 • 50

Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning

Paper • 2506.06205 • Published Jun 6, 2025 • 30

PlayerOne: Egocentric World Simulator

Paper • 2506.09995 • Published Jun 11, 2025 • 34

Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation

Paper • 2506.08570 • Published Jun 10, 2025 • 33

The Diffusion Duality

Paper • 2506.10892 • Published Jun 12, 2025 • 37

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Paper • 2506.13585 • Published Jun 16, 2025 • 273

Unified Vision-Language-Action Model

Paper • 2506.19850 • Published Jun 24, 2025 • 27

WorldVLA: Towards Autoregressive Action World Model

Paper • 2506.21539 • Published Jun 26, 2025 • 40

Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

Paper • 2506.21551 • Published Jun 26, 2025 • 28

Ark: An Open-source Python-based Framework for Robot Learning

Paper • 2506.21628 • Published Jun 24, 2025 • 16

Kwai Keye-VL Technical Report

Paper • 2507.01949 • Published Jul 2, 2025 • 131

Depth Anything at Any Condition

Paper • 2507.01634 • Published Jul 2, 2025 • 49

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Paper • 2507.01925 • Published Jul 2, 2025 • 39

T-LoRA: Single Image Diffusion Model Customization Without Overfitting

Paper • 2507.05964 • Published Jul 8, 2025 • 120

PhysX: Physical-Grounded 3D Asset Generation

Paper • 2507.12465 • Published Jul 16, 2025 • 44

A Survey of Context Engineering for Large Language Models

Paper • 2507.13334 • Published Jul 17, 2025 • 261

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Paper • 2507.13348 • Published Jul 17, 2025 • 79

Set Block Decoding is a Language Model Inference Accelerator

Paper • 2509.04185 • Published Sep 4, 2025 • 54

Why Language Models Hallucinate

Paper • 2509.04664 • Published Sep 4, 2025 • 196

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Paper • 2509.09372 • Published Sep 11, 2025 • 246

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Paper • 2509.09674 • Published Sep 11, 2025 • 80

FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

Paper • 2509.04996 • Published Sep 5, 2025 • 15

QuantAgent: Price-Driven Multi-Agent LLMs for High-Frequency Trading

Paper • 2509.09995 • Published Sep 12, 2025 • 16

ByteWrist: A Parallel Robotic Wrist Enabling Flexible and Anthropomorphic Motion for Confined Spaces

Paper • 2509.18084 • Published Sep 22, 2025 • 13

Residual Off-Policy RL for Finetuning Behavior Cloning Policies

Paper • 2509.19301 • Published Sep 23, 2025 • 19

DA^2: Depth Anything in Any Direction

Paper • 2509.26618 • Published Sep 30, 2025 • 26

LongCodeZip: Compress Long Context for Code Language Models

Paper • 2510.00446 • Published Oct 1, 2025 • 107

VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

Paper • 2510.01623 • Published Oct 2, 2025 • 12

ExGRPO: Learning to Reason from Experience

Paper • 2510.02245 • Published Oct 2, 2025 • 80

Paper2Video: Automatic Video Generation from Scientific Papers

Paper • 2510.05096 • Published Oct 6, 2025 • 119

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

Paper • 2510.08540 • Published Oct 9, 2025 • 109

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Paper • 2510.07242 • Published Oct 8, 2025 • 30

Reinforcing Diffusion Models by Direct Group Preference Optimization

Paper • 2510.08425 • Published Oct 9, 2025 • 12

DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model

Paper • 2510.08556 • Published Oct 9, 2025 • 7

R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation

Paper • 2510.08547 • Published Oct 9, 2025 • 5

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Paper • 2510.05684 • Published Oct 7, 2025 • 143

Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting

Paper • 2510.08696 • Published Oct 9, 2025 • 15

Robot Learning: A Tutorial

Paper • 2510.12403 • Published Oct 14, 2025 • 123

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Paper • 2510.13626 • Published Oct 15, 2025 • 46

ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs

Paper • 2510.04767 • Published Oct 6, 2025 • 28

The Art of Scaling Reinforcement Learning Compute for LLMs

Paper • 2510.13786 • Published Oct 15, 2025 • 32

LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

Paper • 2510.14943 • Published Oct 16, 2025 • 40

VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation

Paper • 2510.14902 • Published Oct 16, 2025 • 17

VLA-0: Building State-of-the-Art VLAs with Zero Modification

Paper • 2510.13054 • Published Oct 15, 2025 • 16

SimKO: Simple Pass@K Policy Optimization

Paper • 2510.14807 • Published Oct 16, 2025 • 11

pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation

Paper • 2510.14974 • Published Oct 16, 2025 • 10

AnyUp: Universal Feature Upsampling

Paper • 2510.12764 • Published Oct 14, 2025 • 12

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Paper • 2510.15742 • Published Oct 17, 2025 • 51

LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal

Paper • 2510.15868 • Published Oct 17, 2025 • 27

RL makes MLLMs see better than SFT

Paper • 2510.16333 • Published Oct 18, 2025 • 49

Chronos-2: From Univariate to Universal Forecasting

Paper • 2510.15821 • Published Oct 17, 2025 • 22

Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

Paper • 2510.16751 • Published Oct 19, 2025 • 21

RoboOmni: Proactive Robot Manipulation in Omni-modal Context

Paper • 2510.23763 • Published Oct 27, 2025 • 56

Exploring Conditions for Diffusion models in Robotic Control

Paper • 2510.15510 • Published Oct 17, 2025 • 40

π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

Paper • 2510.25889 • Published Oct 29, 2025 • 66

World Simulation with Video Foundation Models for Physical AI

Paper • 2511.00062 • Published Oct 28, 2025 • 44

Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

Paper • 2511.01718 • Published Nov 3, 2025 • 7

Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

Paper • 2510.25616 • Published Oct 29, 2025 • 105

Robot Learning from a Physical World Model

Paper • 2511.07416 • Published Nov 10, 2025 • 32

Depth Anything 3: Recovering the Visual Space from Any Views

Paper • 2511.10647 • Published Nov 13, 2025 • 99

SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models

Paper • 2511.15605 • Published Nov 19, 2025 • 24

RynnVLA-002: A Unified Vision-Language-Action and World Model

Paper • 2511.17502 • Published Nov 21, 2025 • 28

MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

Paper • 2511.17889 • Published Nov 22, 2025 • 5

ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

Paper • 2511.20937 • Published Nov 26, 2025 • 16

Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 158

Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach

Paper • 2512.02834 • Published Dec 2, 2025 • 41

Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment

Paper • 2511.22345 • Published Nov 27, 2025 • 13

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Paper • 2512.06963 • Published Dec 7, 2025 • 4

HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

Paper • 2512.09928 • Published Dec 10, 2025 • 14

LEO-RobotAgent: A General-purpose Robotic Agent for Language-driven Embodied Operator

Paper • 2512.10605 • Published Dec 11, 2025 • 7

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

Paper • 2512.06951 • Published Dec 7, 2025 • 4

Memory in the Age of AI Agents

Paper • 2512.13564 • Published Dec 15, 2025 • 151

Openpi Comet: Competition Solution For 2025 BEHAVIOR Challenge

Paper • 2512.10071 • Published Dec 10, 2025 • 18

VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

Paper • 2512.11891 • Published Dec 9, 2025 • 10

Learning Robot Manipulation from Audio World Models

Paper • 2512.08405 • Published Dec 9, 2025 • 2

LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry

Paper • 2512.19629 • Published Dec 22, 2025 • 26

SOP: A Scalable Online Post-Training System for Vision-Language-Action Models

Paper • 2601.03044 • Published Jan 6 • 28

E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models

Paper • 2601.00423 • Published Jan 1 • 11

AT^2PO: Agentic Turn-based Policy Optimization via Tree Search

Paper • 2601.04767 • Published Jan 8 • 28

Beyond Binary Preference: Aligning Diffusion Models to Fine-grained Criteria by Decoupling Attributes

Paper • 2601.04300 • Published Jan 7 • 3

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Paper • 2601.05242 • Published Jan 8 • 228

RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

Paper • 2601.05241 • Published Jan 8 • 24

Solar Open Technical Report

Paper • 2601.07022 • Published Jan 11 • 65

Ministral 3

Paper • 2601.08584 • Published Jan 13 • 54

ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands

Paper • 2512.24965 • Published Dec 31, 2025 • 42

FlowAct-R1: Towards Interactive Humanoid Video Generation

Paper • 2601.10103 • Published Jan 15 • 74

Action100M: A Large-scale Video Action Dataset

Paper • 2601.10592 • Published Jan 15 • 29

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

Paper • 2601.11404 • Published Jan 16 • 26

FrankenMotion: Part-level Human Motion Generation and Composition

Paper • 2601.10909 • Published Jan 15 • 18

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Paper • 2601.12993 • Published Jan 19 • 75

BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Paper • 2601.15197 • Published Jan 21 • 54

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Paper • 2601.16163 • Published Jan 22 • 14

TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

Paper • 2601.14133 • Published Jan 20 • 61

A Pragmatic VLA Foundation Model

Paper • 2601.18692 • Published about 1 month ago • 47

Advancing Open-source World Models

Paper • 2601.20540 • Published 29 days ago • 128

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Paper • 2601.22153 • Published 28 days ago • 71

Beyond Imitation: Reinforcement Learning for Active Latent Planning

Paper • 2601.21598 • Published 28 days ago • 9

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

Paper • 2601.20218 • Published 29 days ago • 15

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

Paper • 2602.00919 • Published 26 days ago • 305

SoMA: A Real-to-Sim Neural Simulator for Robotic Soft-body Manipulation

Paper • 2602.02402 • Published 24 days ago • 32

VLS: Steering Pretrained Robot Policies via Vision-Language Models

Paper • 2602.03973 • Published 23 days ago • 22

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Paper • 2602.10098 • Published 16 days ago • 18

PhyCritic: Multimodal Critic Models for Physical AI

Paper • 2602.11124 • Published 15 days ago • 52

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

Paper • 2602.12099 • Published 14 days ago • 57

χ_{0}: Resource-Aware Robust Manipulation via Taming Distributional Inconsistencies

Paper • 2602.09021 • Published 17 days ago • 25

RISE: Self-Improving Robot Policy with Compositional World Model

Paper • 2602.11075 • Published 15 days ago • 30

EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration

Paper • 2602.10106 • Published 16 days ago • 21

Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

Paper • 2602.16705 • Published 8 days ago • 26

RynnBrain: Open Embodied Foundation Models

Paper • 2602.14979 • Published 13 days ago • 42

World Action Models are Zero-shot Policies

Paper • 2602.15922 • Published 9 days ago • 11

TactAlign: Human-to-Robot Policy Transfer via Tactile Alignment

Paper • 2602.13579 • Published 12 days ago • 10

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Paper • 2602.20309 • Published 3 days ago • 10