ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents Paper • 2507.22827 • Published 25 days ago • 94
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations Paper • 2506.18898 • Published Jun 23 • 33
Multimodal Long Video Modeling Based on Temporal Dynamic Context Paper • 2504.10443 • Published Apr 14 • 4
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention Paper • 2303.16199 • Published Mar 28, 2023 • 4
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Paper • 2304.15010 • Published Apr 28, 2023 • 4
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation Paper • 2502.16707 • Published Feb 23 • 13
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? Paper • 2412.02611 • Published Dec 3, 2024 • 24
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant Paper • 2410.13360 • Published Oct 17, 2024 • 9
OneLLM: One Framework to Align All Modalities with Language Paper • 2312.03700 • Published Dec 6, 2023 • 24
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models Paper • 2311.07575 • Published Nov 13, 2023 • 15