|
|
--- |
|
|
library_name: chatterbox |
|
|
tags: |
|
|
- chatterbox |
|
|
- text-to-speech |
|
|
- tts |
|
|
- german |
|
|
- kartoffel |
|
|
- speech generation |
|
|
- voice-cloning |
|
|
language: |
|
|
- de |
|
|
base_model: |
|
|
- ResembleAI/chatterbox |
|
|
pipeline_tag: text-to-speech |
|
|
license: cc-by-nc-nd-4.0 |
|
|
--- |
|
|
|
|
|
# Kartoffel-TTS (Based on Chatterbox) - German Text-to-Speech |
|
|
> Modell is still in development and was only trained on 600k samples without emotion classification on my 2 RTX 3090s. I am currently in the process of setting up more data (>2.5M) and classify the exaggeration. |
|
|
|
|
|
## Updates |
|
|
- **v0.2**: |
|
|
- Added preview support for vocal expressions. Supported ones are: `<haha>`, `<hahaha>`, `<hahahaha>`, `<chuckle>`,`<wuhuuu>`, `<wow>`, `<hmm_neugierig>`, `<hmph>`, `<huh>`, `<ohhh>`, `<oooh>`, `<ughh>`, `<eeehhh>`, `<aaaaaaah>`, `<aaach>`. |
|
|
- Adjusted filestructure to match original chatterbox one with original s3, ve, etc. . The only trained file is the `t3_cfg.safetensor`. This should simplify the usage with different libraries. |
|
|
|
|
|
|
|
|
|
|
|
<video src="https://huggingface.co/SebastianBodza/Kartoffelbox-v0.1/resolve/main/demo_kartoffelbox.mp4" alt="Demo Video" width="400" controls></video> |
|
|
|
|
|
<div style="display: flex;align-items: center; gap: 12px"> |
|
|
<a target="_blank" href="https://huggingface.co/spaces/SebastianBodza/Kartoffelbox"> |
|
|
<img src="https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg" alt="Open in HuggingFace"/> |
|
|
</a> |
|
|
|
|
|
<a href="https://colab.research.google.com/drive/1ZNT08zrEuAeuH3VrsaMHeeqZFcZR8sHU?usp=sharing" rel="nofollow"> |
|
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"> |
|
|
</a> |
|
|
</div> |
|
|
|
|
|
## Updates |
|
|
|
|
|
- The model has been rebuilt using **Chatterbox**, Resemble AI's open-source TTS framework. This allows for **emotion exaggeration control** and improved stability. |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
Kartoffel-TTS is a German text-to-speech (TTS) model family based on **Chatterbox**, designed for natural and expressive speech synthesis. The model supports **emotion exaggeration control**, and voice cloning. |
|
|
|
|
|
### Key Features: |
|
|
1. **Emotion Exaggeration Control**: Adjust the intensity of emotions in speech, from subtle to dramatic. |
|
|
2. **Expressive Speech**: Capable of producing speech with different emotional tones and expressions. |
|
|
3. **Fine-Tuned for German**: Optimized for German language synthesis with a focus on naturalness and clarity. |
|
|
|
|
|
|
|
|
## Installation |
|
|
|
|
|
Install the required libraries: |
|
|
|
|
|
```bash |
|
|
pip install chatterbox-tts |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage Example |
|
|
|
|
|
Here’s how to generate speech using Kartoffel-TTS: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import soundfile as sf |
|
|
from chatterbox.tts import ChatterboxTTS |
|
|
from huggingface_hub import hf_hub_download |
|
|
from safetensors.torch import load_file |
|
|
|
|
|
MODEL_REPO = "SebastianBodza/Kartoffelbox-v0.1" |
|
|
T3_CHECKPOINT_FILE = "t3_cfg.safetensors" |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
model = ChatterboxTTS.from_pretrained(device=device) |
|
|
|
|
|
print("Downloading and applying German patch...") |
|
|
checkpoint_path = hf_hub_download(repo_id=MODEL_REPO, filename=T3_CHECKPOINT_FILE) |
|
|
|
|
|
t3_state = load_file(checkpoint_path, device="cpu") |
|
|
|
|
|
model.t3.load_state_dict(t3_state) |
|
|
print("Patch applied successfully.") |
|
|
|
|
|
|
|
|
text = "Tief im verwunschenen Wald, wo die Bäume uralte Geheimnisse flüsterten, lebte ein kleiner Gnom namens Fips, der die Sprache der Tiere verstand." |
|
|
|
|
|
reference_audio_path = "/content/uitoll.mp3" |
|
|
output_path = "output_cloned_voice.wav" |
|
|
|
|
|
print("Generating speech...") |
|
|
with torch.inference_mode(): |
|
|
wav = model.generate( |
|
|
text, |
|
|
audio_prompt_path=reference_audio_path, |
|
|
exaggeration=0.5, |
|
|
temperature=0.6, |
|
|
cfg_weight=0.3, |
|
|
) |
|
|
|
|
|
sf.write(output_path, wav.squeeze().cpu().numpy(), model.sr) |
|
|
print(f"Audio saved to {output_path}") |
|
|
``` |
|
|
|
|
|
## Contributing |
|
|
|
|
|
To improve the model further, additional high-quality German audio data with good transcripts are needed, especially for sounds like laughter, sighs, or other non-verbal expressions. Short audio clips (up to 60 seconds) with accurate transcriptions are particularly valuable. |
|
|
|
|
|
For those with ideas or access to relevant data, collaboration opportunities are always welcome. Reach out to discuss potential contributions. |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
This model builds on the following technologies: |
|
|
- **Chatterbox** by Resemble AI |
|
|
- **Cosyvoice** |
|
|
- **HiFT-GAN** |
|
|
- **Llama** |
|
|
- **S3Tokenizer** |