F5-TTS German Fine-tuned Model
⚠️ Work in Progress: This model is still under development and optimization. We are actively seeking feedback from the community to improve its performance. Please share your experiences, issues, and suggestions!
Model Description
This is a German fine-tuned version of the F5-TTS (Flow Matching) model, specifically trained on German voice datasets. F5-TTS is a diffusion-transformer based text-to-speech system that uses flow matching for high-quality, natural-sounding speech synthesis.
Key Features
- Language: German text-to-speech synthesis
- Architecture: DiT (Diffusion Transformer) with ConvNeXt V2
- Sample Rate: 24 kHz
- Vocoder: Vocos for high-quality audio generation
- Tokenization: Custom character-level tokenization for German text
Model Details
- Base Model: F5TTS_v1_Base
- Fine-tuning Dataset: Combined German voice dataset with character-level tokenization
- Training Steps: ~298,000 steps
- Vocabulary Size: 2,546 characters
- Model Size: ~1.3GB (inference-optimized)
Installation
# Install F5-TTS
pip install f5-tts
# Or install from source for latest features
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
pip install -e .
Usage
Quick Start with Hugging Face Hub
import torch
import torchaudio
from f5_tts.api import F5TTS
from huggingface_hub import hf_hub_download
# Download model files from Hugging Face
model_file = hf_hub_download(
repo_id="tabularisai/f5-tts-german-voice-clone",
filename="model.pt"
)
vocab_file = hf_hub_download(
repo_id="tabularisai/f5-tts-german-voice-clone",
filename="vocab.txt"
)
# Initialize the German F5-TTS model
f5tts = F5TTS(
model="F5TTS_v1_Base", # Use the base architecture
ckpt_file=model_file, # Downloaded model weights
vocab_file=vocab_file, # German vocabulary
device="cuda" if torch.cuda.is_available() else "cpu"
)
# German text to synthesize
text = "Hallo, ich bin ein deutsches Text-zu-Sprache-System. Wie kann ich Ihnen heute helfen?"
# Reference audio
ref_audio_path = "reference_german_voice.wav"
ref_text = "Dies ist eine Referenzaufnahme für die Stimmenklonierung."
# Generate speech
audio, sample_rate, seed = f5tts.infer(
gen_text=text,
ref_file=ref_audio_path,
ref_text=ref_text,
remove_silence=True,
file_wave="output_german.wav",
)
Advanced Usage
# For longer texts, you can use the advanced inference (works with both Hugging Face and local files)
audio, sample_rate = f5tts.infer(
text=text,
ref_audio=ref_audio_path,
ref_text=ref_text,
nfe_step=32, # Number of function evaluations (higher = better quality)
cfg_strength=2.0, # Classifier-free guidance strength
sway_sampling_coef=-1.0, # Sway sampling for better quality
speed=1.0, # Generation speed (1.0 = normal speed)
remove_silence=True,
cross_fade_duration=0.15 # For smoother concatenation
)
Command Line Usage
# Using the F5-TTS CLI with the German model
f5-tts_infer-cli \
--model F5TTS_v1_Base \
--ckpt_file path/to/model.pt \
--vocab_file path/to/vocab.txt \
--ref_audio reference_german.wav \
--ref_text "Referenztext für die Stimme" \
--gen_text "Zu synthetisierender deutscher Text" \
--output_path output_german.wav
Voice Cloning
The model supports voice cloning with German reference audio:
# Use a German reference voice
ref_audio = "my_german_voice_sample.wav"
ref_text = "Das ist ein Beispieltext meiner Stimme."
# Clone the voice for new German text
new_text = "Jetzt spreche ich mit der geklonten Stimme diesen neuen Text."
audio, sr = f5tts.infer(text=new_text, ref_audio=ref_audio, ref_text=ref_text)
Model Performance
Supported Text Features
- ✅ German characters and umlauts (ä, ö, ü, ß)
- ✅ Numbers and punctuation
- ✅ Special characters
- ✅ Mixed case text
- ⚠️ Limited support for non-German characters
Audio Quality
- Sample Rate: 24 kHz
- Bit Depth: 16-bit
- Quality: High-quality neural vocoding with Vocos
- Latency: Real-time capable on modern GPUs
Limitations and Known Issues
- Language Specific: Optimized for German text only
- Training Data: Limited to specific German voice datasets
- Accent Variation: May not capture all German regional accents
- Performance: Requires GPU for real-time inference
- Development Status: Still in active development
Contributing and Feedback
We need your help! This model is still being refined and we're looking for:
- 🗣️ Audio Quality Feedback: How does the generated speech sound?
- 📝 Text Handling: Issues with specific German words or phrases?
- 🐛 Bug Reports: Technical issues or errors
- 💡 Feature Requests: What would make this model more useful?
- 📊 Performance Reports: Speed and quality benchmarks
- 🎯 Use Case Examples: How are you using this model?
How to Provide Feedback
- GitHub Issues: Report bugs or request features in the original F5-TTS repository
- Audio Samples: Share problematic or excellent generation examples
- Benchmarks: Compare with other German TTS systems
- Documentation: Help improve usage instructions
Model Card
Property | Value |
---|---|
Language | German (Deutsch) |
Model Type | Text-to-Speech (Flow Matching) |
Architecture | DiT (Diffusion Transformer) |
Parameters | ~1B parameters |
Training Data | Combined German voice datasets |
Vocabulary | 2,546 character tokens |
Sample Rate | 24 kHz |
Citation
If you use this model in your research, please cite the original F5-TTS paper:
@article{chen2024f5tts,
title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
author={Chen, Yushen and others},
journal={arXiv preprint arXiv:2410.06885},
year={2024}
}
Acknowledgments
- Original F5-TTS team for the excellent framework
- German voice dataset contributors
- The open-source community for feedback and improvements
Contact
For questions, feedback, or collaboration:
- Open an issue in the F5-TTS repository
- Join the community discussions
- Share your experiences with German TTS
[email protected]
Status: 🚧 Under Development - Seeking Community Feedback 🚧
- Downloads last month
- 12
Model tree for tabularisai/f5-tts-german-voice-clone
Base model
SWivid/F5-TTS