stereoplegic 's Collections Multimodal
updated
Woodpecker: Hallucination Correction for Multimodal Large Language
Models
Paper
• 2310.16045
• Published
• 17
HallusionBench: You See What You Think? Or You Think What You See? An
Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5,
and Other Multi-modality Models
Paper
• 2310.14566
• Published
• 27
SILC: Improving Vision Language Pretraining with Self-Distillation
Paper
• 2310.13355
• Published
• 9
Conditional Diffusion Distillation
Paper
• 2310.01407
• Published
• 20
Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative
Editing
Paper
• 2310.12404
• Published
• 15
MusicAgent: An AI Agent for Music Understanding and Generation with
Large Language Models
Paper
• 2310.11954
• Published
• 25
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Paper
• 2309.16058
• Published
• 57
Jointly Training Large Autoregressive Multimodal Models
Paper
• 2309.15564
• Published
• 8
Empowering Vision-Language Models to Follow Interleaved Vision-Language
Instructions
Paper
• 2308.04152
• Published
• 2
Multimodal Foundation Models: From Specialists to General-Purpose
Assistants
Paper
• 2309.10020
• Published
• 41
Language as the Medium: Multimodal Video Classification through text
only
Paper
• 2309.10783
• Published
• 1
Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants
Paper
• 2310.00653
• Published
• 3
Kosmos-2.5: A Multimodal Literate Model
Paper
• 2309.11419
• Published
• 56
You Only Look at Screens: Multimodal Chain-of-Action Agents
Paper
• 2309.11436
• Published
• 1
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
Paper
• 2310.00704
• Published
• 21
Leveraging Unpaired Data for Vision-Language Generative Models via Cycle
Consistency
Paper
• 2310.03734
• Published
• 15
Aligning Text-to-Image Diffusion Models with Reward Backpropagation
Paper
• 2310.03739
• Published
• 22
Improved Baselines with Visual Instruction Tuning
Paper
• 2310.03744
• Published
• 39
Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic
Image Design and Generation
Paper
• 2310.08541
• Published
• 18
Visual Storytelling with Question-Answer Plans
Paper
• 2310.05295
• Published
• 1
Aligning Large Multimodal Models with Factually Augmented RLHF
Paper
• 2309.14525
• Published
• 32
Toward Joint Language Modeling for Speech Units and Text
Paper
• 2310.08715
• Published
• 9
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Paper
• 2306.05425
• Published
• 12
NExT-GPT: Any-to-Any Multimodal LLM
Paper
• 2309.05519
• Published
• 79
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Paper
• 2310.05737
• Published
• 6
X-LLM: Bootstrapping Advanced Large Language Models by Treating
Multi-Modalities as Foreign Languages
Paper
• 2305.04160
• Published
• 2
MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning
Paper
• 2310.09478
• Published
• 21
Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models
Paper
• 2308.13437
• Published
• 4
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
• 2308.12966
• Published
• 11
Ziya-VL: Bilingual Large Vision-Language Model via Multi-Task
Instruction Tuning
Paper
• 2310.08166
• Published
• 1
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
Models
Paper
• 2310.08825
• Published
• 1
Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs
Paper
• 2310.00582
• Published
• 1
Large-Scale Automatic Audiobook Creation
Paper
• 2309.03926
• Published
• 56
Kosmos-G: Generating Images in Context with Multimodal Large Language
Models
Paper
• 2310.02992
• Published
• 4
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models
Paper
• 2309.04041
• Published
• 1
Multimodal Graph Learning for Generative Tasks
Paper
• 2310.07478
• Published
• 1
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
Paper
• 2309.09958
• Published
• 20
TextBind: Multi-turn Interleaved Multimodal Instruction-following
Paper
• 2309.08637
• Published
• 7
ImageBind-LLM: Multi-modality Instruction Tuning
Paper
• 2309.03905
• Published
• 18
Never-ending Learning of User Interfaces
Paper
• 2308.08726
• Published
• 2
LMDX: Language Model-based Document Information Extraction and
Localization
Paper
• 2309.10952
• Published
• 67
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial
Understanding
Paper
• 2310.15308
• Published
• 23
TiC-CLIP: Continual Training of CLIP Models
Paper
• 2310.16226
• Published
• 10
ConvNets Match Vision Transformers at Scale
Paper
• 2310.16764
• Published
• 21
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons
Images
Paper
• 2310.16825
• Published
• 36
A Picture is Worth a Thousand Words: Principled Recaptioning Improves
Image Generation
Paper
• 2310.16656
• Published
• 53
From Sparse to Soft Mixtures of Experts
Paper
• 2308.00951
• Published
• 22
Experts Weights Averaging: A New General Training Scheme for Vision
Transformers
Paper
• 2308.06093
• Published
• 2
Self-slimmed Vision Transformer
Paper
• 2111.12624
• Published
• 1
Robustifying Token Attention for Vision Transformers
Paper
• 2303.11126
• Published
• 1
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks
Paper
• 2306.04073
• Published
• 2
Retrieval-Augmented Multimodal Language Modeling
Paper
• 2211.12561
• Published
• 1
Long-range Language Modeling with Self-retrieval
Paper
• 2306.13421
• Published
• 17
Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning
Paper
• 2303.08566
• Published
• 1
SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual
Speech Representation
Paper
• 2205.08180
• Published
• 1
PVP: Pre-trained Visual Parameter-Efficient Tuning
Paper
• 2304.13639
• Published
• 1
Do We Really Need a Large Number of Visual Prompts?
Paper
• 2305.17223
• Published
• 1
Alternating Gradient Descent and Mixture-of-Experts for Integrated
Multimodal Perception
Paper
• 2305.06324
• Published
• 1
Zorro: the masked multimodal transformer
Paper
• 2301.09595
• Published
• 2
Attention Bottlenecks for Multimodal Fusion
Paper
• 2107.00135
• Published
• 1
Contrastive Audio-Visual Masked Autoencoder
Paper
• 2210.07839
• Published
• 1
On Robustness in Multimodal Learning
Paper
• 2304.04385
• Published
• 1
Meta-Transformer: A Unified Framework for Multimodal Learning
Paper
• 2307.10802
• Published
• 45
Using Multiple Instance Learning to Build Multimodal Representations
Paper
• 2212.05561
• Published
• 1
LMEye: An Interactive Perception Network for Large Language Models
Paper
• 2305.03701
• Published
• 2
Concept-Oriented Deep Learning with Large Language Models
Paper
• 2306.17089
• Published
• 1
AGIBench: A Multi-granularity, Multimodal, Human-referenced,
Auto-scoring Benchmark for Large Language Models
Paper
• 2309.06495
• Published
• 1
Multimodal Multi-Hop Question Answering Through a Conversation Between
Tools and Efficiently Finetuned Large Language Models
Paper
• 2309.08922
• Published
• 1
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large
Language Model Signals for Science Question Answering
Paper
• 2305.03453
• Published
• 1
ViperGPT: Visual Inference via Python Execution for Reasoning
Paper
• 2303.08128
• Published
• 2
Visual Programming: Compositional visual reasoning without training
Paper
• 2211.11559
• Published
• 1
Generalization Differences between End-to-End and Neuro-Symbolic
Vision-Language Reasoning Systems
Paper
• 2210.15037
• Published
• 1
Diversifying Joint Vision-Language Tokenization Learning
Paper
• 2306.03421
• Published
• 2
Joint Adaptive Representations for Image-Language Learning
Paper
• 2305.19924
• Published
• 1
Modular Visual Question Answering via Code Generation
Paper
• 2306.05392
• Published
• 2
TouchStone: Evaluating Vision-Language Models by Language Models
Paper
• 2308.16890
• Published
• 1
MMICL: Empowering Vision-language Model with Multi-Modal In-Context
Learning
Paper
• 2309.07915
• Published
• 4
VIGC: Visual Instruction Generation and Correction
Paper
• 2308.12714
• Published
• 1
Latent Consistency Models: Synthesizing High-Resolution Images with
Few-Step Inference
Paper
• 2310.04378
• Published
• 22
MetaFormer Is Actually What You Need for Vision
Paper
• 2111.11418
• Published
• 1
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Paper
• 2310.11441
• Published
• 29
An Image is Worth Multiple Words: Learning Object Level Concepts using
Multi-Concept Prompt Learning
Paper
• 2310.12274
• Published
• 13
Matryoshka Diffusion Models
Paper
• 2310.15111
• Published
• 45
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
Paper
• 2310.11440
• Published
• 17
Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient
Vision Transformers
Paper
• 2303.13755
• Published
• 1
DSG: An End-to-End Document Structure Generator
Paper
• 2310.09118
• Published
• 2
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models
across Computer Vision Tasks
Paper
• 2310.19909
• Published
• 21
Beyond U: Making Diffusion Models Faster & Lighter
Paper
• 2310.20092
• Published
• 12
i-Code Studio: A Configurable and Composable Framework for Integrative
AI
Paper
• 2305.13738
• Published
• 1
AssistGPT: A General Multi-modal Assistant that can Plan, Execute,
Inspect, and Learn
Paper
• 2306.08640
• Published
• 27
Corpus Synthesis for Zero-shot ASR domain Adaptation using Large
Language Models
Paper
• 2309.10707
• Published
• 2
M^3IT: A Large-Scale Dataset towards Multi-Modal Multilingual
Instruction Tuning
Paper
• 2306.04387
• Published
• 9
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language
Models
Paper
• 2308.00675
• Published
• 37
Evaluating the Capability of Large-scale Language Models on Chinese
Grammatical Error Correction Task
Paper
• 2307.03972
• Published
• 1
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
Understanding
Paper
• 2306.17107
• Published
• 12
GPT4Tools: Teaching Large Language Model to Use Tools via
Self-instruction
Paper
• 2305.18752
• Published
• 5
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and
Compositional Experts
Paper
• 2305.14839
• Published
• 1
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo
Labelling
Paper
• 2311.00430
• Published
• 56
Reproducing Whisper-Style Training Using an Open-Source Toolkit and
Publicly Available Data
Paper
• 2309.13876
• Published
• 1
HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models
Paper
• 2309.15701
• Published
• 2
Massive End-to-end Models for Short Search Queries
Paper
• 2309.12963
• Published
• 1
Whispering LLaMA: A Cross-Modal Generative Error Correction Framework
for Speech Recognition
Paper
• 2310.06434
• Published
• 4
MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics
Transcription
Paper
• 2108.02625
• Published
• 1
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation,
Generation and Editing
Paper
• 2311.00571
• Published
• 43
TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression
For On-device ASR Models
Paper
• 2309.01947
• Published
• 1
Scaling Multimodal Pre-Training via Cross-Modality Gradient
Harmonization
Paper
• 2211.02077
• Published
• 1
MUTEX: Learning Unified Policies from Multimodal Task Specifications
Paper
• 2309.14320
• Published
• 1
Linking Representations with Multimodal Contrastive Learning
Paper
• 2304.03464
• Published
• 1
Beyond Attentive Tokens: Incorporating Token Importance and Diversity
for Efficient Vision Transformers
Paper
• 2211.11315
• Published
• 1
Attention or Convolution: Transformer Encoders in Audio Language Models
for Inference Efficiency
Paper
• 2311.02772
• Published
• 8
FLAP: Fast Language-Audio Pre-training
Paper
• 2311.01615
• Published
• 17
Multi-Mode Online Knowledge Distillation for Self-Supervised Visual
Representation Learning
Paper
• 2304.06461
• Published
• 1
UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation
Paper
• 2303.05668
• Published
• 1
One-Step Knowledge Distillation and Fine-Tuning in Using Large
Pre-Trained Self-Supervised Learning Models for Speaker Verification
Paper
• 2305.17394
• Published
• 1
PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech
Representations
Paper
• 2203.16965
• Published
• 1
Task-Agnostic Structured Pruning of Speech Representation Models
Paper
• 2306.01385
• Published
• 1
Recycle-and-Distill: Universal Compression Strategy for
Transformer-based Speech SSL Models with Attention Map Reusing and Masking
Distillation
Paper
• 2305.11685
• Published
• 2
Beyond Universal Transformer: block reusing with adaptor in Transformer
for automatic speech recognition
Paper
• 2303.13072
• Published
• 1
MultiWay-Adapater: Adapting large-scale multi-modal models for scalable
image-text retrieval
Paper
• 2309.01516
• Published
• 1
Visual Query Tuning: Towards Effective Usage of Intermediate
Representations for Parameter and Memory Efficient Transfer Learning
Paper
• 2212.03220
• Published
• 1
Residual Mixture of Experts
Paper
• 2204.09636
• Published
• 1
End-to-end Knowledge Retrieval with Multi-modal Queries
Paper
• 2306.00424
• Published
• 1
A Symmetric Dual Encoding Dense Retrieval Framework for
Knowledge-Intensive Visual Question Answering
Paper
• 2304.13649
• Published
• 1
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
Paper
• 2311.04589
• Published
• 21
CODA-Prompt: COntinual Decomposed Attention-based Prompting for
Rehearsal-Free Continual Learning
Paper
• 2211.13218
• Published
• 1
When Prompt-based Incremental Learning Does Not Meet Strong Pretraining
Paper
• 2308.10445
• Published
• 1
PILOT: A Pre-Trained Model-Based Continual Learning Toolbox
Paper
• 2309.07117
• Published
• 2
A Simple Baseline that Questions the Use of Pretrained-Models in
Continual Learning
Paper
• 2210.04428
• Published
• 1
A soft nearest-neighbor framework for continual semi-supervised learning
Paper
• 2212.05102
• Published
• 1
Avalanche: an End-to-End Library for Continual Learning
Paper
• 2104.00405
• Published
• 2
SequeL: A Continual Learning Library in PyTorch and JAX
Paper
• 2304.10857
• Published
• 1
ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient
Vision Transformer
Paper
• 2306.06446
• Published
• 1
An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training
Paper
• 2306.17165
• Published
• 1
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture
of Experts
Paper
• 2105.03036
• Published
• 2
Language-Routing Mixture of Experts for Multilingual and Code-Switching
Speech Recognition
Paper
• 2307.05956
• Published
• 1
Cross-token Modeling with Conditional Computation
Paper
• 2109.02008
• Published
• 1
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal
Language Models
Paper
• 2311.05997
• Published
• 37
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Paper
• 2311.06243
• Published
• 21
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor
Cores
Paper
• 2311.05908
• Published
• 14
Continual Learning for Monolingual End-to-End Automatic Speech
Recognition
Paper
• 2112.09427
• Published
• 1
Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
Paper
• 2208.08340
• Published
• 1
MVP: Meta Visual Prompt Tuning for Few-Shot Remote Sensing Image Scene
Classification
Paper
• 2309.09276
• Published
• 1
Approximated Prompt Tuning for Vision-Language Pre-trained Models
Paper
• 2306.15706
• Published
• 1
MEGAVERSE: Benchmarking Large Language Models Across Languages,
Modalities, Models and Tasks
Paper
• 2311.07463
• Published
• 15
LCM-LoRA: A Universal Stable-Diffusion Acceleration Module
Paper
• 2311.05556
• Published
• 87
Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of
Experts And Frequency-augmented Decoder Approach
Paper
• 2310.12004
• Published
• 2
From Words to Music: A Study of Subword Tokenization Techniques in
Symbolic Music Generation
Paper
• 2304.08953
• Published
• 2
Adaptive Sparse and Monotonic Attention for Transformer-based Automatic
Speech Recognition
Paper
• 2209.15176
• Published
• 1
Decoder-only Architecture for Speech Recognition with CTC Prompts and
Text Data Augmentation
Paper
• 2309.08876
• Published
• 1
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
Paper
• 2312.04410
• Published
• 15
Phonetic-assisted Multi-Target Units Modeling for Improving
Conformer-Transducer ASR system
Paper
• 2211.01571
• Published
• 1
E-Branchformer: Branchformer with Enhanced merging for speech
recognition
Paper
• 2210.00077
• Published
• 2
Interpret Vision Transformers as ConvNets with Dynamic Convolutions
Paper
• 2309.10713
• Published
• 1
EfficientFormer: Vision Transformers at MobileNet Speed
Paper
• 2206.01191
• Published
• 1
COMCAT: Towards Efficient Compression and Customization of
Attention-Based Vision Models
Paper
• 2305.17235
• Published
• 2
Semi-Autoregressive Streaming ASR With Label Context
Paper
• 2309.10926
• Published
• 1
eP-ALM: Efficient Perceptual Augmentation of Language Models
Paper
• 2303.11403
• Published
• 3
OneLLM: One Framework to Align All Modalities with Language
Paper
• 2312.03700
• Published
• 24
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action
Paper
• 2312.17172
• Published
• 30
Augmenting text for spoken language understanding with Large Language
Models
Paper
• 2309.09390
• Published
• 2
Audiobox: Unified Audio Generation with Natural Language Prompts
Paper
• 2312.15821
• Published
• 17
Generative AI Beyond LLMs: System Implications of Multi-Modal Generation
Paper
• 2312.14385
• Published
• 7
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training
Paper
• 2401.00849
• Published
• 17
Mug-STAN: Adapting Image-Language Pretrained Models for General Video
Understanding
Paper
• 2311.15075
• Published
• 2
MLLMs-Augmented Visual-Language Representation Learning
Paper
• 2311.18765
• Published
• 1
InfMLLM: A Unified Framework for Visual-Language Tasks
Paper
• 2311.06791
• Published
• 3
Generative Multimodal Models are In-Context Learners
Paper
• 2312.13286
• Published
• 36
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware
representations to LLMs and Emergent Cross-modal Reasoning
Paper
• 2311.18799
• Published
• 1
Training Transformers Together
Paper
• 2207.03481
• Published
• 6
SwitchGPT: Adapting Large Language Models for Non-Text Outputs
Paper
• 2309.07623
• Published
• 1
DocLLM: A layout-aware generative language model for multimodal document
understanding
Paper
• 2401.00908
• Published
• 189
Diffusion Model Alignment Using Direct Preference Optimization
Paper
• 2311.12908
• Published
• 49
Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis
Paper
• 2312.03491
• Published
• 34
Efficient Monotonic Multihead Attention
Paper
• 2312.04515
• Published
• 8
Qwen-Audio: Advancing Universal Audio Understanding via Unified
Large-Scale Audio-Language Models
Paper
• 2311.07919
• Published
• 10
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models
Paper
• 2311.07575
• Published
• 15
Towards General-Purpose Speech Abilities for Large Language Models Using
Unpaired Data
Paper
• 2311.06753
• Published
• 7
LayoutPrompter: Awaken the Design Ability of Large Language Models
Paper
• 2311.06495
• Published
• 12
Honeybee: Locality-enhanced Projector for Multimodal LLM
Paper
• 2312.06742
• Published
• 13
SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient
Channels
Paper
• 2309.08513
• Published
• 2
SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models
Paper
• 2305.05189
• Published
• 3
TextDiffuser: Diffusion Models as Text Painters
Paper
• 2305.10855
• Published
• 4
Multimodal Garment Designer: Human-Centric Latent Diffusion Models for
Fashion Image Editing
Paper
• 2304.02051
• Published
• 4
DITTO: Diffusion Inference-Time T-Optimization for Music Generation
Paper
• 2401.12179
• Published
• 21
StreamVoice: Streamable Context-Aware Language Modeling for Real-time
Zero-Shot Voice Conversion
Paper
• 2401.11053
• Published
• 11
FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder
Paper
• 2401.10032
• Published
• 13
BAE-Net: A Low complexity and high fidelity Bandwidth-Adaptive neural
network for speech super-resolution
Paper
• 2312.13722
• Published
• 1
Incremental FastPitch: Chunk-based High Quality Text to Speech
Paper
• 2401.01755
• Published
• 10
CoMoSVC: Consistency Model-based Singing Voice Conversion
Paper
• 2401.01792
• Published
• 11
Towards High-Quality and Efficient Speech Bandwidth Extension with
Parallel Amplitude and Phase Prediction
Paper
• 2401.06387
• Published
• 1
Multi-Scale Sub-Band Constant-Q Transform Discriminator for
High-Fidelity Vocoder
Paper
• 2311.14957
• Published
• 3
VMamba: Visual State Space Model
Paper
• 2401.10166
• Published
• 40
Factorization Vision Transformer: Modeling Long Range Dependency with
Local Window Cost
Paper
• 2312.08614
• Published
• 1
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper
• 2401.13601
• Published
• 48
ModaVerse: Efficiently Transforming Modalities with LLMs
Paper
• 2401.06395
• Published
• 3
Video Understanding with Large Language Models: A Survey
Paper
• 2312.17432
• Published
• 3
Boosting Large Language Model for Speech Synthesis: An Empirical Study
Paper
• 2401.00246
• Published
• 14
Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage
and Sharing in LLMs
Paper
• 2311.15759
• Published
• 1
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in
Large Multimodal Models
Paper
• 2401.13311
• Published
• 12
Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models
Paper
• 2309.01479
• Published
• 1
Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for
End-to-End Speech Recognition
Paper
• 2209.08326
• Published
• 1
Mixture-of-experts VAEs can disregard variation in surjective multimodal
data
Paper
• 2204.05229
• Published
• 1
One Model, Multiple Modalities: A Sparsely Activated Approach for Text,
Sound, Image, Video and Code
Paper
• 2205.06126
• Published
• 1
simple diffusion: End-to-end diffusion for high resolution images
Paper
• 2301.11093
• Published
• 2
Vision Mamba: Efficient Visual Representation Learning with
Bidirectional State Space Model
Paper
• 2401.09417
• Published
• 62
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image
Segmentation
Paper
• 2401.13560
• Published
• 1
Vivim: a Video Vision Mamba for Medical Video Object Segmentation
Paper
• 2401.14168
• Published
• 2
2-D SSM: A General Spatial Layer for Visual Transformers
Paper
• 2306.06635
• Published
• 1
IconShop: Text-Guided Vector Icon Synthesis with Autoregressive
Transformers
Paper
• 2304.14400
• Published
• 4
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit
Paper
• 2312.09911
• Published
• 55
StarVector: Generating Scalable Vector Graphics Code from Images
Paper
• 2312.11556
• Published
• 37
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on
E-Branchformer
Paper
• 2401.16658
• Published
• 14
OtterHD: A High-Resolution Multi-modality Model
Paper
• 2311.04219
• Published
• 34
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and
reusing ModulEs
Paper
• 2311.04901
• Published
• 9
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with
Modality Collaboration
Paper
• 2311.04257
• Published
• 22
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
Paper
• 2311.05348
• Published
• 13
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper
• 2311.05437
• Published
• 51
Link-Context Learning for Multimodal LLMs
Paper
• 2308.07891
• Published
• 17
Empowering LLM to use Smartphone for Intelligent Task Automation
Paper
• 2308.15272
• Published
• 1
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture
of Experts
Paper
• 2206.02770
• Published
• 4
Deep Lifelong Cross-modal Hashing
Paper
• 2304.13357
• Published
• 1
Multi-Dimensional Hyena for Spatial Inductive Bias
Paper
• 2309.13600
• Published
• 1
FLatten Transformer: Vision Transformer using Focused Linear Attention
Paper
• 2308.00442
• Published
• 1
PALO: A Polyglot Large Multimodal Model for 5B People
Paper
• 2402.14818
• Published
• 23
SpeechAgents: Human-Communication Simulation with Multi-Modal
Multi-Agent Systems
Paper
• 2401.03945
• Published
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
• 2402.14289
• Published
• 20
MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks
Paper
• 2309.14118
• Published
DistriFusion: Distributed Parallel Inference for High-Resolution
Diffusion Models
Paper
• 2402.19481
• Published
• 22
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
• 2402.13232
• Published
• 16
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large
Vision-Language Models
Paper
• 2402.13577
• Published
• 9
Finetuned Multimodal Language Models Are High-Quality Image-Text Data
Filters
Paper
• 2403.02677
• Published
• 18
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of
Large Vision-Language Models
Paper
• 2403.00231
• Published
• 2
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
• 2403.07508
• Published
• 77
Diffusion Models Without Attention
Paper
• 2311.18257
• Published
• 3
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
Large Language Models
Paper
• 2403.13447
• Published
• 19
MambaIR: A Simple Baseline for Image Restoration with State-Space Model
Paper
• 2402.15648
• Published
FiT: Flexible Vision Transformer for Diffusion Model
Paper
• 2402.12376
• Published
• 48
SSM Meets Video Diffusion Models: Efficient Video Generation with
Structured State Spaces
Paper
• 2403.07711
• Published
• 1
Scalable Diffusion Models with State Space Backbone
Paper
• 2402.05608
• Published
LocalMamba: Visual State Space Model with Windowed Selective Scan
Paper
• 2403.09338
• Published
• 8
VideoMamba: State Space Model for Efficient Video Understanding
Paper
• 2403.06977
• Published
• 29
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
Paper
• 2402.08846
• Published
• 2
Transparent Image Layer Diffusion using Latent Transparency
Paper
• 2402.17113
• Published
• 5
On Speculative Decoding for Multimodal Large Language Models
Paper
• 2404.08856
• Published
• 13
Good Seed Makes a Good Crop: Discovering Secret Seeds in Text-to-Image
Diffusion Models
Paper
• 2405.14828
• Published
Chat-UniVi: Unified Visual Representation Empowers Large Language Models
with Image and Video Understanding
Paper
• 2311.08046
• Published
• 2
UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language
Models
Paper
• 2405.10311
• Published
Matryoshka Multimodal Models
Paper
• 2405.17430
• Published
• 34
Speak While You Think: Streaming Speech Synthesis During Text Generation
Paper
• 2309.11210
• Published
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
• 2406.11271
• Published
• 21
GrootVL: Tree Topology is All You Need in State Space Model
Paper
• 2406.02395
• Published
• 1
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic
Scientific Workflows
Paper
• 2505.19897
• Published
• 104
Better Together: Leveraging Unpaired Multimodal Data for Stronger
Unimodal Models
Paper
• 2510.08492
• Published
• 10
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
Paper
• 2602.07026
• Published
• 135
GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning
Paper
• 2602.12099
• Published
• 56