Audio-Text-to-Text
Transformers
Safetensors
English
llava_llama
text-generation
nielsr HF Staff commited on
Commit
fe48c05
·
verified ·
1 Parent(s): 4ef61b7

Add comprehensive model card for LLaSO-Base-3.8B-Instruct with pipeline tag, library name, and dataset links

Browse files

This PR significantly enhances the model card for `LLaSO-Base-3.8B-Instruct`, a foundational model from the LLaSO framework for Large Language and Speech Models.

Key improvements include:
- Adding the `pipeline_tag: audio-text-to-text`, making the model discoverable in relevant searches on the Hugging Face Hub (e.g., https://huggingface.co/models?pipeline_tag=audio-text-to-text).
- Specifying `library_name: transformers`, which enables the automated "how to use" widget on the model page with a standard `transformers` code snippet.
- Including the `language: en` tag, as indicated in the model's configuration.
- Linking the model to its official paper: [LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model](https://huggingface.co/papers/2508.15418).
- Providing explicit links to the associated datasets (`LLaSO-Align`, `LLaSO-Instruct`, `LLaSO-Eval`) in the metadata and content.
- Adding a comprehensive overview of the LLaSO framework and the LLaSO-Base model's key features, adapted from the original GitHub repository.
- Including a practical code example for inference using the `transformers` library, which is directly compatible with the automated widget.
- Linking to the official GitHub repository for further details and code.

Please review these additions for accuracy and completeness.

Files changed (1) hide show
  1. README.md +136 -3
README.md CHANGED
@@ -1,3 +1,136 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: audio-text-to-text
4
+ library_name: transformers
5
+ language: en
6
+ datasets:
7
+ - YirongSun/LLaSO-Align
8
+ - YirongSun/LLaSO-Instruct
9
+ - YirongSun/LLaSO-Eval
10
+ ---
11
+
12
+ # LLaSO-Base-3.8B-Instruct: A Foundational Framework for Reproducible Research in Large Language and Speech Models
13
+
14
+ This repository contains **LLaSO-Base-3.8B-Instruct**, a 3.8B-parameter reference model from the **LLaSO** framework. LLaSO is introduced as the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling to advance reproducible research in the field of Large Speech-Language Models (LSLMs).
15
+
16
+ LLaSO-Base is trained exclusively on public data provided by the LLaSO framework, achieving a strong, reproducible baseline (normalized score of 0.72) for compositional speech-language understanding across 20 tasks.
17
+
18
+ <p align="center">
19
+ <a href="https://huggingface.co/datasets/YirongSun/LLaSO-Align"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Align-16a085.svg" alt="HF Align"></a>
20
+ <a href="https://huggingface.co/datasets/YirongSun/LLaSO-Instruct"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Instruct-1abc9c.svg" alt="HF Ins"></a>
21
+ <a href="https://huggingface.co/datasets/YirongSun/LLaSO-Eval"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Eval-27ae60.svg" alt="HF Eval"></a>
22
+ <br>
23
+ <a href="https://huggingface.co/papers/2508.15418"><img src="https://img.shields.io/badge/arXiv-2508.15418-B31B1B.svg" alt="arXiv"></a>
24
+ <a href="https://huggingface.co/YirongSun/LLaSO-Base-3.8B-Instruct"><img src="https://img.shields.io/badge/HuggingFace-Model-ffcc00.svg" alt="HF Model"></a>
25
+ <a href="https://github.com/EIT-NLP/LLaSO"><img src="https://img.shields.io/github/stars/EIT-NLP/LLaSO?style=social" alt="GitHub Stars"></a>
26
+ </p>
27
+
28
+ * **Paper:** [LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model](https://huggingface.co/papers/2508.15418)
29
+ * **Code & Project Page:** [https://github.com/EIT-NLP/LLaSO](https://github.com/EIT-NLP/LLaSO)
30
+
31
+ ## 🔍 What is LLaSO?
32
+ **LLaSO is the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling in one framework.**
33
+
34
+ The framework provides three essential resources:
35
+ - **LLaSO-Align (12.0M):** An ASR-based alignment corpus for grounding speech in textual semantic space.
36
+ - **LLaSO-Instruct (13.5M / 20 tasks / 3 modality configs):** A multi-task instruction-tuning dataset across linguistic, semantic, and paralinguistic objectives.
37
+ - **LLaSO-Eval (15,044):** A reproducible benchmark for standardized evaluation, particularly for instruction-following and cross-modality generalization.
38
+ - **LLaSO-Base (3.8B):** This model, a two-stage trained reference model adapted from LLaVA-style architectures for robust compositional understanding.
39
+
40
+ <p align="center">
41
+ <img src="https://github.com/EIT-NLP/LLaSO/raw/main/figures/radar.png" width="600" alt="LLaSO overall performance">
42
+ </p>
43
+ <p align="center"><i>
44
+ LLaSO-Base achieves a strong normalized overall score on LLaSO-Eval across 20 tasks spanning linguistic, semantic, and paralinguistic categories.
45
+ </i></p>
46
+
47
+ ## ✨ Key Features
48
+ - **Fully Open, End-to-End Stack:** Unified release of corpus, benchmark, and model enabling open-source research and fair comparison in speech-language modeling.
49
+ - **25.5M Samples, 20 Tasks, 3 Modality Configurations:** Supports all major text ↔ audio combinations (text + audio, audio + text, pure audio), covering linguistic, semantic, and paralinguistic tasks.
50
+ - **Stratified Evaluation (15,044):** Cohesive design between training and test sets enables systematic assessment of instruction following, cross-modality generalization, abstention rate, and stability.
51
+ - **Robust Reference Model (3.8B):** Two-stage training (ASR alignment → instruction tuning), easily reproducible and extensible for further research.
52
+ - **Empirical Insights:** Broader task and modality coverage consistently leads to stronger overall performance, but unseen modality/task configurations (especially pure audio) remain challenging; interleaving and parallel decoding strategies can bridge some gaps.
53
+
54
+ <p align="center">
55
+ <img src="https://github.com/EIT-NLP/LLaSO/raw/main/figures/architecture_trim.png" width="350" alt="Architecture & Two-Stage Training (Figure 6)"><br>
56
+ <i>Architecture & Two-Stage Training</i>
57
+ </p>
58
+
59
+ ## 🚀 Usage
60
+
61
+ You can use this model with the `transformers` library. Here's a quick example for inference:
62
+
63
+ ```python
64
+ import torch
65
+ from transformers import AutoProcessor, AutoModelForCausalLM
66
+ import librosa
67
+ import soundfile as sf
68
+ import os
69
+ import numpy as np
70
+
71
+ # Load model and processor
72
+ model_path = "YirongSun/LLaSO-Base-3.8B-Instruct"
73
+ processor = AutoProcessor.from_pretrained(model_path)
74
+ model = AutoModelForCausalLM.from_pretrained(
75
+ model_path, torch_dtype=torch.bfloat16, device_map="auto"
76
+ )
77
+ model.eval()
78
+
79
+ # Example audio input (replace with your audio file)
80
+ # For demonstration, creating a dummy audio file
81
+ dummy_audio_path = "dummy_audio.wav"
82
+ sr = 16000
83
+ duration = 5 # seconds
84
+ dummy_audio_data = (np.random.rand(sr * duration) * 0.5).astype(np.float32)
85
+ sf.write(dummy_audio_path, dummy_audio_data, sr)
86
+
87
+ # Load audio and process it
88
+ audio, rate = librosa.load(dummy_audio_path, sr=sr)
89
+ audio_inputs = processor(audio=audio, sampling_rate=rate, return_tensors="pt")
90
+
91
+ # Example text prompt
92
+ # The LLaSO models are Llama-3-based, so use the corresponding chat template.
93
+ # The `processor`'s chat template automatically handles adding special tokens for roles.
94
+ # The model uses "<audio_start>" and "<audio_end>" tokens, which are usually handled internally
95
+ # when `audio_values` are passed, or explicitly via tokenization if part of the text prompt.
96
+ # Here, we pass `audio_values` separately as common in multimodal models.
97
+ prompt = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>
98
+ Transcribe the audio.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
99
+ "
100
+ text_inputs = processor(text=prompt, return_tensors="pt")
101
+
102
+ # Combine inputs
103
+ inputs = {
104
+ "input_ids": text_inputs.input_ids.to(model.device),
105
+ "attention_mask": text_inputs.attention_mask.to(model.device),
106
+ "audio_values": audio_inputs.audio_values.to(model.device)
107
+ }
108
+
109
+ # Generate response
110
+ with torch.inference_mode():
111
+ outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.6, top_p=0.9)
112
+
113
+ # Decode and print
114
+ decoded_output = processor.decode(outputs[0], skip_special_tokens=True)
115
+ print(f"Generated Text: {decoded_output}")
116
+
117
+ # Clean up dummy audio file
118
+ os.remove(dummy_audio_path)
119
+ ```
120
+
121
+ For more detailed usage, training instructions, and advanced evaluation scenarios, please refer to the [LLaSO GitHub repository](https://github.com/EIT-NLP/LLaSO).
122
+
123
+ ## 📑 How to Cite
124
+ If you use LLaSO in your research or applications, please cite our paper:
125
+
126
+ ```bibtex
127
+ @misc{sun2025llaso,
128
+ title={LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model},
129
+ author={Yirong Sun and Yizhong Geng and Peidong Wei and Yanjun Chen and Jinghan Yang and Rongfei Chen and Wei Zhang and Xiaoyu Shen},
130
+ year={2025},
131
+ eprint={2508.15418},
132
+ archivePrefix={arXiv},
133
+ primaryClass={cs.CL},
134
+ url={https://arxiv.org/abs/2508.15418},
135
+ }
136
+ ```