|
--- |
|
tags: |
|
- text-to-speech |
|
license: cc-by-nc-sa-4.0 |
|
language: |
|
- zh |
|
- en |
|
- de |
|
- ja |
|
- fr |
|
- es |
|
- ko |
|
- ar |
|
- nl |
|
- ru |
|
- it |
|
- pl |
|
- pt |
|
pipeline_tag: text-to-speech |
|
inference: false |
|
extra_gated_prompt: >- |
|
You agree to not use the model to generate contents that violate DMCA or local |
|
laws. |
|
extra_gated_fields: |
|
Country: country |
|
Specific date: date_picker |
|
I agree to use this model for non-commercial use ONLY: checkbox |
|
--- |
|
|
|
|
|
# OpenAudio S1 |
|
|
|
**OpenAudio S1** is a leading text-to-speech (TTS) model trained on more than 2 million hours of audio data in multiple languages. |
|
|
|
Supported languages: |
|
- English (en) |
|
- Chinese (zh) |
|
- Japanese (ja) |
|
- German (de) |
|
- French (fr) |
|
- Spanish (es) |
|
- Korean (ko) |
|
- Arabic (ar) |
|
- Russian (ru) |
|
- Dutch (nl) |
|
- Italian (it) |
|
- Polish (pl) |
|
- Portuguese (pt) |
|
|
|
Please refer to [Fish Speech Github](https://github.com/fishaudio/fish-speech) for more info. |
|
Demo available at [Fish Audio Playground](https://fish.audio). |
|
Visit the [OpenAudio website](https://openaudio.com) for blog & tech report. |
|
|
|
## Emotion and Tone Support |
|
|
|
OpenAudio S1 supports a variety of emotional, tone, and special markers to enhance speech synthesis: |
|
|
|
**1. Emotional markers:** |
|
(angry) (sad) (disdainful) (excited) (surprised) (satisfied) (unhappy) (anxious) (hysterical) (delighted) (scared) (worried) (indifferent) (upset) (impatient) (nervous) (guilty) (scornful) (frustrated) (depressed) (panicked) (furious) (empathetic) (embarrassed) (reluctant) (disgusted) (keen) (moved) (proud) (relaxed) (grateful) (confident) (interested) (curious) (confused) (joyful) (disapproving) (negative) (denying) (astonished) (serious) (sarcastic) (conciliative) (comforting) (sincere) (sneering) (hesitating) (yielding) (painful) (awkward) (amused) |
|
|
|
**2. Tone markers:** |
|
(in a hurry tone) (shouting) (screaming) (whispering) (soft tone) |
|
|
|
**3. Special markers:** |
|
(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) (groaning) (crowd laughing) (background laughter) (audience laughing) |
|
|
|
**Special markers with corresponding onomatopoeia:** |
|
- Laughing: Ha,ha,ha |
|
- Chuckling: Hmm,hmm |
|
|
|
## Model Variants and Performance |
|
|
|
OpenAudio S1 includes the following models: |
|
- **S1 (4B, proprietary):** The full-sized model. |
|
- **S1-mini (0.5B):** A distilled version of S1. |
|
|
|
Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF). |
|
|
|
**Seed TTS Eval Metrics (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM):** |
|
|
|
- **S1:** |
|
- WER (Word Error Rate): **0.008** |
|
- CER (Character Error Rate): **0.004** |
|
- Distance: **0.332** |
|
- **S1-mini:** |
|
- WER (Word Error Rate): **0.011** |
|
- CER (Character Error Rate): **0.005** |
|
- Distance: **0.380** |
|
|
|
## License |
|
|
|
This model is permissively licensed under the CC-BY-NC-SA-4.0 license. |
|
|