bosonai
/

higgs-audio-v2-generation-3B-base

+---
+license: apache-2.0
+language:
+- en
+- zh
+- de
+- ko
+---
+# HiggsAudio-V2: Redefining Expressiveness in Audio Generation
+We are open-sourcing HiggsAudio-V2, a powerful audio foundation model pretrained on over 10 million hours of audio data and a diverse set of text data.
+Despite having no post-training or fine-tuning, HiggsAudio-V2 excels in expressive audio generation, thanks to its deep language and acoustic understanding.
+On [EmergentTTS-Eval](https://github.com/boson-ai/emergenttts-eval-public), the model achieves win rates of **75.7%** and **55.7%** over "gpt-4o-mini-tts" on the "Emotions" and "Questions" categories, respectively. It also obtains state-of-the-art performance on traditional TTS benchmarks like Seed-TTS Eval and Emotional Speech Dataset (ESD). Moreover, the model demonstrates capabilities rarely seen in previous systems, including automatic prosody adaptation during narration, zero-shot generation of natural multi-speaker dialogues in multiple languages, melodic humming with the cloned voice, and simultaneous generation of speech and background music.
+Check our open-source repository https://github.com/boson-ai/higgs-audio for more details.
+<p>
+    <img src="./emergent-tts-emotions-win-rate.png" width=900>
+</p>
+## Technical Details
+<p>
+    <img src="./higgs_audio_v2_architecture_combined.png" width=900>
+</p>
+HiggsAudio-V2 adopts the "generation variant" depicted in the architecture figure above. Its strong performance is driven by three key technical innovations:
+- We developed an automated annotation pipeline that leverages multiple ASR models, sound event classification models, and our in-house audio understanding model. Using this pipeline, we cleaned and annotated 10 million hours audio data, which we refer to as AudioVerse. The in-house understanding model is finetuned on top of Higgs Audio V1 Understanding, which adopts the "understanding variant" shown in the architecture figure.
+- We trained a unified audio tokenizer from scratch that captures both semantic and acoustic features.
+- We proposed the DualFFN architecture, which enhances the LLM’s ability to model acoustics tokens with minimal computational overhead.
+### Audio Tokenizer
+<p>
+    <img src="./higgs_audio_tokenizer_architecture.png" width=900>
+</p>
+We introduce a new discretized audio tokenizer that runs at just 25 frames per second while keeping—or even improving—audio quality compared to tokenizers with twice the bitrate.
+Our model is the first to train on 24 kHz data covering speech, music, and sound events in one unified system.
+It also uses a simple non-diffusion encoder/decoder for fast, batch inference. It achieves state-of-the-art performance in semantic and acoustic evaluations.
+Check https://huggingface.co/bosonai/higgs-audio-v2-tokenizer-staging for more information about the tokenizer.
+### Model Architecture -- Dual FFN
+HiggsAudio-V2 is built on top of [Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B). To enhance the model’s ability to process audio tokens,
+we incorporate the "DualFFN" architecture as an audio adapter.
+DualFFN acts as an audio-specific expert, boosting the LLM's performance with minimal computational overhead.
+Our implementation preserves 91% of the original LLM’s training speed with the inclusion of DualFFN, which has 2.2B parameters.
+Thus, the total number of parameter for HiggsAudio-V2 is 3.6B (LLM) + 2.2B (Audio Dual FFN), and it has the same training / inference FLOPs as Llama-3.2-3B.
+Ablation study shows that the model equipped with DualFFN consistently outperforms its counterpart in terms of word error rate (WER) and speaker similarity.
+See [Higgs-Audio Architecture Blog](https://github.com/boson-ai/higgs-audio/tech_blogs/ARCHITECTURE_BLOG.md) for more information.
+## Evaluation
+Here's the performance of HiggsAudio-V2 on four benchmarks,  [Seed-TTS Eval](https://github.com/BytedanceSpeech/seed-tts-eval), [Emotional Speech Dataset (ESD)](https://paperswithcode.com/dataset/esd), [EmergentTTS-Eval](https://arxiv.org/abs/2505.23009), and Multi-speaker Eval:
+#### Seed-TTS Eval & ESD
+We prompt HiggsAudio-V2 with `<ref_text, ref_audio, text>` for zero-shot TTS. We adopt the standard evaluation metric in Seed-TTS Eval and ESD.
+|                              | SeedTTS-Eval| | ESD   |                 |
+|------------------------------|--------|--------|---------|-------------------|
+|                              | WER ↓ | SIM ↑ | WER ↓ | SIM (emo2vec) ↑ |
+| Cosyvoice2                   | 2.28   | 65.49  | 2.71    | 80.48             |
+| Qwen2.5-omni†                | 2.33   | 64.10  | -       | -                 |
+| ElevenLabs Multilingual V2   | **1.43**   | 50.00  | 1.66    | 65.87             |
+| HiggsAudio V1                | 2.18   | 66.27  | **1.49**    | 82.84             |
+| HiggsAudio V2 (base)         | 2.44   | **67.70**  | 1.78    | **86.13**         |
+#### EmergentTTS-Eval ("Emotions" and "Questions")
+Following the [EmergentTTS-Eval Paper](https://arxiv.org/abs/2505.23009), we report the win-rate over "gpt-4o-mini-tts" with the "alloy" voice. For HiggsAudio-V2, we report its performance by shallow-cloning the ["belinda"](examples/voice_prompts/belinda.wav) voice.
+| Model                              | Emotions (%) ↑ | Questions (%) ↑ |
+|------------------------------------|--------------|----------------|
+| HiggsAudio-V2 (base)               | **75.71%**   | **55.71%**         |
+| [gpt-4o-audio-preview†](https://platform.openai.com/docs/models/gpt-4o-audio-preview)       | 61.64%       | 47.85%         |
+| [Hume.AI](https://www.hume.ai/research)                            | 61.60%       | 43.21%         |
+| **BASELINE:** [gpt-4o-mini-tts](https://platform.openai.com/docs/models/gpt-4o-mini-tts)  | 50.00%       | 50.00%         |
+| [Qwen 2.5 Omni†](https://github.com/QwenLM/Qwen2.5-Omni)      | 41.60%       | 51.78%         |
+| [minimax/speech-02-hd](https://replicate.com/minimax/speech-02-hd)               | 40.86%        | 47.32%         |
+| [ElevenLabs Multilingual v2](https://elevenlabs.io/blog/eleven-multilingual-v2)         | 30.35%       | 39.46%         |
+| [DeepGram Aura-2](https://deepgram.com/learn/introducing-aura-2-enterprise-text-to-speech)                    | 29.28%       | 48.21%         |
+| [Sesame csm-1B](https://github.com/SesameAILabs/csm)                      | 15.96%       | 31.78%         |
+<sup><sub>'†' means using the strong-prompting method described in the paper.</sub></sup>
+#### Multi-speaker Eval
+We also designed a multi-speaker evaluation benchmark to evaluate the capability of HiggsAudio-V2 for multi-speaker dialog generation. The benchmark contains three subsets
+- `two-speaker-conversation`: 1000 synthetic dialogues involving two speakers. It contains two reference audio clips to evaluate the model’s ability in double voice cloning.
+- `small talk`: 250 synthetic dialogues characterized by short utterances and a limited number of turns (4–6). It also contains two reference audio clips to test double voice cloning, though the dialogues are shorter and simpler than those in two-speaker-conversation.
+- `small talk (no ref)`: 250 synthetic dialogues, also with short utterances and 4–6 turns. Unlike the other subsets, it does not include reference audio and is designed to evaluate the model’s ability to automatically assign appropriate voices to speakers.
+We evaluate the word-error-rate (WER) and the geometric mean between intra-speaker similarity and inter-speaker dis-similarity on these three subsets. Other than HiggsAudio-V2, we also evaluated [MoonCast](https://github.com/jzq2000/MoonCast) and [nari-labs/dia](https://github.com/nari-labs/dia). Results are summarized in the following table. See more details in the [multi-speaker evaluation blog](PLACEHOLDER).
+|                                                | two-speaker-conversation |                |small talk |                | small talk (no ref) |                |
+| ---------------------------------------------- | -------------- | ------------------ | ---------- | -------------- | ------------------- | -------------- |
+|                                                | WER ↓                      | Mean Sim & Dis-sim ↑ | WER ↓       |  Mean Sim & Dis-sim ↑ | WER ↓               | Mean Sim & Dis-sim ↑ |
+| [MoonCast](https://github.com/jzq2000/MoonCast) | 38.77                    | 46.02         | **8.33**       | 63.68          | 24.65               | 53.94 |
+| [nari-labs/dia](https://github.com/nari-labs/dia)         | \-                       | \-             | 17.62      | 63.15          | 19.46               | **61.14**          |
+| HiggsAudio-V2 (base)     | **18.88**                    | **51.95**          | 11.89      | **67.92**              | **14.65**               | 55.28              |
+## Get Started
+You need to first install the [higgs-audio codebase](https://github.com/boson-ai/higgs-audio):
+```bash
+git clone https://github.com/boson-ai/higgs-audio.git
+cd higgs-audio
+python3 -m venv higgs_audio_env
+source higgs_audio_env/bin/activate
+pip install -r requirements.txt
+pip install -e .
+```
+Afterwards, you can launch generation examples via the `generation.py` provided in the repository.
+```bash
+python3 examples/generation.py \
+--transcript examples/transcript/single_speaker/en_basic.txt \
+--ref_audio belinda \
+--seed 12345
+```
+Alternatively, here's a python script that you can try to convert text to speech after you have installed higgs-audio.
+## License
+TBA