Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models
Abstract
Loc3R-VLM enhances 2D Vision-Language Models with 3D understanding capabilities through spatial supervision from monocular video input, achieving superior performance in language-based localization and 3D question-answering.
Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm
Community
Loc3R-VLM enables 3D understanding from video input. Inspired by human cognition, it builds an internal cognitive map of the global environment while explicitly modeling an agent's position and orientation. By jointly capturing global layout and egocentric state, the model excels at two core tasks: language-driven localization and viewpoint-aware 3D reasoning.
🚀 Project Page: https://kevinqu7.github.io/loc3r-vlm/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning (2026)
- Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding (2026)
- Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding (2026)
- SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning (2026)
- DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding (2026)
- VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models (2026)
- Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper