|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- wenetspeech |
|
|
- gigaspeech |
|
|
- common_voice |
|
|
- iemocap |
|
|
- crema-d |
|
|
- meld |
|
|
- ravdess |
|
|
- tess |
|
|
- dailytalk |
|
|
- aishell-1 |
|
|
- emotiontalk |
|
|
- cs-dialogue |
|
|
- voxceleb2 |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-7B-Instruct |
|
|
pipeline_tag: audio-text-to-text |
|
|
tags: |
|
|
- speech |
|
|
- speech-llm |
|
|
- audio |
|
|
- instruction-free |
|
|
- paralinguistic |
|
|
--- |
|
|
|
|
|
# AZeroS |
|
|
|
|
|
**AZeroS** (Auden Zero-instruction-tuned Speech-LLM) extends a frozen LLM to speech via |
|
|
**Self-Generated Instruction-Free Tuning (SIFT)**. It keeps the LLM and audio encoders frozen and |
|
|
trains lightweight projection modules on speech–text pairs, achieving strong semantic and |
|
|
paralinguistic performance with modest training cost, generalizing well to unseen instructions. |
|
|
|
|
|
🔗 **Paper**: https://arxiv.org/pdf/2601.06086 |
|
|
🔗 **Code**: https://github.com/AudenAI/Auden/tree/main/examples/azeros |
|
|
🔗 **Model**: https://huggingface.co/AudenAI/azeros |
|
|
🔗 **Auden Repo**: https://github.com/AudenAI/Auden |
|
|
|
|
|
## 🔍 What Can This Model Do? |
|
|
|
|
|
- 🎙️ **Speech understanding** (semantic content understanding and dialog) |
|
|
- 😊 **Paralinguistic analysis** (emotion, age, gender, etc.) |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from model import AZerosModel |
|
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
model = AZerosModel.from_pretrained("AudenAI/azeros").to(device) |
|
|
|
|
|
wav_files = ["speech1.wav", "speech2.wav"] |
|
|
messages = [ |
|
|
[ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": f"{model.audio_token_wrapped} Please analyze speech content and paralinguistic information.", |
|
|
} |
|
|
] |
|
|
for _ in wav_files |
|
|
] |
|
|
|
|
|
generate_config = { |
|
|
"max_new_tokens": 200, |
|
|
"num_beams": 1, |
|
|
"do_sample": False, |
|
|
"min_length": 1, |
|
|
"repetition_penalty": 1.0, |
|
|
"length_penalty": 1.0, |
|
|
"top_p": None, |
|
|
"top_k": None, |
|
|
"temperature": None, |
|
|
} |
|
|
|
|
|
outputs = model.generate(wav_files, messages, **generate_config) |
|
|
print(outputs) |
|
|
``` |
|
|
|
|
|
## Auden Setup (Required) |
|
|
|
|
|
This model relies on the Auden codebase for loading and inference: |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/AudenAI/Auden.git |
|
|
cd Auden |
|
|
pip install -e . |
|
|
cd examples/azeros |
|
|
``` |
|
|
|
|
|
## 📌 Model Characteristics |
|
|
|
|
|
- Input: Raw audio waveform (16 kHz) or text |
|
|
- Output: Text responses regarding to the input |
|
|
- Backend LLM: Qwen2.5-7B-Instruct |
|
|
- Encoders: [TTA](https://huggingface.co/AudenAI/auden-encoder-tta-m10) and [Auden-Voice](https://huggingface.co/AudenAI/auden-encoder-voice) |
|
|
- Architecture: Frozen LLM + frozen audio encoders + lightweight projection modules |
|
|
- Training paradigm: Self-Generated Instruction-Free Tuning (SIFT) |
|
|
|
|
|
## 📊 Evaluation |
|
|
|
|
|
### VoiceBench |
|
|
|
|
|
| Model | Alpaca Eval | Comm Eval | Wild Voice | SD-QA | BBH | Adv Bench | IF Eval | OBQA | MMSU | Overall | |
|
|
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | |
|
|
| **Text Only Model** | | | | | | | | | | | |
|
|
| Qwen2.5 | 4.66 | 4.55 | 4.62 | 62.03 | 80.00 | 99.04 | 70.14 | 84.84 | 71.57 | 82.69 | |
|
|
| Qwen2.5 (TN) | 4.61 | 4.53 | 4.56 | 63.84 | 56.30 | 98.85 | 66.11 | 74.07 | 64.51 | 77.52 | |
|
|
| **Cascaded System** | | | | | | | | | | | |
|
|
| Whisper+GPT-4o | 4.80 | 4.47 | 4.62 | 75.77 | 87.20 | 98.27 | 76.51 | 92.97 | 81.69 | 87.80 | |
|
|
| Whisper+Qwen2.5 | 4.64 | 4.33 | 4.21 | 58.50 | 52.85 | 98.27 | 63.99 | 78.24 | 69.00 | 76.05 | |
|
|
| **End-to-end Speech-LLM** | | | | | | | | | | | |
|
|
| GPT-4o | 4.78 | 4.49 | 4.58 | 75.50 | 84.10 | 98.65 | 76.02 | 89.23 | 80.25 | 86.75 | |
|
|
| Moshi | 2.01 | 1.60 | 1.30 | 15.64 | 47.40 | 44.23 | 10.12 | 25.93 | 24.04 | 29.51 | |
|
|
| Phi-4-multimodal | 3.81 | 3.82 | 3.56 | 39.78 | 61.80 | 100.00 | 45.35 | 65.93 | 42.19 | 64.32 | |
|
|
| GLM-4-Voice | 3.97 | 3.42 | 3.18 | 36.98 | 52.80 | 88.08 | 25.92 | 53.41 | 39.75 | 56.48 | |
|
|
| Qwen2-Audio | 3.42 | 3.29 | 2.76 | 31.65 | 53.00 | 99.04 | 26.35 | 48.35 | 36.14 | 53.77 | |
|
|
| DeSTA2.5 | 3.73 | 2.52 | 3.30 | 46.47 | 62.40 | 97.69 | 65.47 | 72.75 | 58.56 | 66.04 | |
|
|
| Qwen2.5-Omni | 3.88 | 3.77 | 3.52 | 46.75 | 63.70 | 97.31 | 40.19 | 81.54 | 61.45 | 68.26 | |
|
|
| Qwen3-Omni-30B | 4.74 | 4.54 | 4.58 | 76.90 | 80.40 | 99.30 | 77.80 | 89.70 | 68.10 | **85.49** | |
|
|
| **AZeroS (ours)** | 4.44 | 4.18 | 3.91 | 60.22 | 56.30 | 98.65 | 61.29 | 72.09 | 59.01 | **73.13** | |
|
|
|
|
|
### AIRBench |
|
|
|
|
|
| Model | Gender | Emotion | Age | LID | Entity | Intent | Avg | Chat | |
|
|
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | |
|
|
| **Cascaded System** | | | | | | | | | |
|
|
| Whisper+GPT-4o | 21.90 | 59.50 | 41.10 | 96.80 | 69.80 | 87.70 | 62.80 | 7.54 | |
|
|
| Whisper+Qwen2.5 | 28.36 | 50.80 | 36.40 | 88.00 | 73.60 | 82.70 | 59.98 | 7.34 | |
|
|
| **End-to-end Speech-LLM** | | | | | | | | | |
|
|
| GPT-4o | * | 49.10 | * | 76.00 | 61.60 | 85.80 | * | 7.53 | |
|
|
| Gemini2.5-pro | 90.70 | 60.70 | 34.10 | 99.10 | 68.50 | 92.20 | 74.22 | 8.52 | |
|
|
| SALMONN | 35.50 | 29.90 | 48.70 | 28.10 | 51.70 | 36.70 | 38.43 | 6.16 | |
|
|
| GLM-4-Voice | 23.91 | 22.95 | 18.70 | 25.40 | 27.90 | 21.10 | 23.33 | 5.53 | |
|
|
| Qwen2-Audio | 64.71 | 48.15 | 23.10 | 77.80 | 87.00 | 84.70 | 64.24 | 7.20 | |
|
|
| DeSTA2.5 | 84.24 | 64.30 | 65.60 | 97.30 | 65.20 | 83.70 | 76.72 | 7.57 | |
|
|
| Qwen2.5-Omni | 89.76 | 54.85 | 44.80 | 89.70 | 79.70 | 88.60 | 74.57 | 6.97 | |
|
|
| Qwen3-Omni-30B | 91.11 | 62.20 | 36.90 | 97.70 | 80.40 | 90.70 | **76.50** | **7.85** | |
|
|
| **AZeroS (ours)** | 86.75 | 71.45 | 61.30 | 84.80 | 73.60 | 85.60 | **77.25** | **8.28** | |
|
|
|
|
|
*An additional prompt is added to ensure steady output of choices: “Please make your choice among A/B/C/D and do not output other texts.”* |
|
|
|
|
|
## ⚠️ Limitations |
|
|
|
|
|
- Trained on public datasets; performance may degrade on out-of-domain audio. |
|
|
- Not designed for safety-critical applications. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use AZeroS in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{shao2026azeros, |
|
|
title={AZEROS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning}, |
|
|
author={Shao, Yiwen and Liu, Wei and Li, Jiahong and Wang, Tianzi and Wei, Kun and Yu, Meng and Yu, Dong}, |
|
|
journal={arXiv preprint arXiv:2601.06086}, |
|
|
year={2026} |
|
|
} |
|
|
``` |
|
|
|