F5-TTS German Fine-tuned Model

⚠️ Work in Progress: This model is still under development and optimization. We are actively seeking feedback from the community to improve its performance. Please share your experiences, issues, and suggestions!

Model Description

This is a German fine-tuned version of the F5-TTS (Flow Matching) model, specifically trained on German voice datasets. F5-TTS is a diffusion-transformer based text-to-speech system that uses flow matching for high-quality, natural-sounding speech synthesis.

Key Features

Language: German text-to-speech synthesis
Architecture: DiT (Diffusion Transformer) with ConvNeXt V2
Sample Rate: 24 kHz
Vocoder: Vocos for high-quality audio generation
Tokenization: Custom character-level tokenization for German text

Model Details

Base Model: F5TTS_v1_Base
Fine-tuning Dataset: Combined German voice dataset with character-level tokenization
Training Steps: ~298,000 steps
Vocabulary Size: 2,546 characters
Model Size: ~1.3GB (inference-optimized)

Installation

# Install F5-TTS
pip install f5-tts

# Or install from source for latest features
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
pip install -e .

Usage

Quick Start with Hugging Face Hub

import torch
import torchaudio
from f5_tts.api import F5TTS
from huggingface_hub import hf_hub_download

# Download model files from Hugging Face
model_file = hf_hub_download(
    repo_id="tabularisai/f5-tts-german-voice-clone",
    filename="model.pt"
)
vocab_file = hf_hub_download(
    repo_id="tabularisai/f5-tts-german-voice-clone", 
    filename="vocab.txt"
)

# Initialize the German F5-TTS model
f5tts = F5TTS(
    model="F5TTS_v1_Base",  # Use the base architecture
    ckpt_file=model_file,  # Downloaded model weights
    vocab_file=vocab_file,  # German vocabulary
    device="cuda" if torch.cuda.is_available() else "cpu"
)

# German text to synthesize
text = "Hallo, ich bin ein deutsches Text-zu-Sprache-System. Wie kann ich Ihnen heute helfen?"

# Reference audio 
ref_audio_path = "reference_german_voice.wav"
ref_text = "Dies ist eine Referenzaufnahme für die Stimmenklonierung."

# Generate speech
audio, sample_rate, seed = f5tts.infer(
    gen_text=text,
    ref_file=ref_audio_path,
    ref_text=ref_text,
    remove_silence=True,
    file_wave="output_german.wav",
)

Advanced Usage

# For longer texts, you can use the advanced inference (works with both Hugging Face and local files)
audio, sample_rate = f5tts.infer(
    text=text,
    ref_audio=ref_audio_path,
    ref_text=ref_text,
    nfe_step=32,  # Number of function evaluations (higher = better quality)
    cfg_strength=2.0,  # Classifier-free guidance strength
    sway_sampling_coef=-1.0,  # Sway sampling for better quality
    speed=1.0,  # Generation speed (1.0 = normal speed)
    remove_silence=True,
    cross_fade_duration=0.15  # For smoother concatenation
)

Command Line Usage

# Using the F5-TTS CLI with the German model
f5-tts_infer-cli \
    --model F5TTS_v1_Base \
    --ckpt_file path/to/model.pt \
    --vocab_file path/to/vocab.txt \
    --ref_audio reference_german.wav \
    --ref_text "Referenztext für die Stimme" \
    --gen_text "Zu synthetisierender deutscher Text" \
    --output_path output_german.wav

Voice Cloning

The model supports voice cloning with German reference audio:

# Use a German reference voice
ref_audio = "my_german_voice_sample.wav"
ref_text = "Das ist ein Beispieltext meiner Stimme."

# Clone the voice for new German text
new_text = "Jetzt spreche ich mit der geklonten Stimme diesen neuen Text."
audio, sr = f5tts.infer(text=new_text, ref_audio=ref_audio, ref_text=ref_text)

Model Performance

Supported Text Features

✅ German characters and umlauts (ä, ö, ü, ß)
✅ Numbers and punctuation
✅ Special characters
✅ Mixed case text
⚠️ Limited support for non-German characters

Audio Quality

Sample Rate: 24 kHz
Bit Depth: 16-bit
Quality: High-quality neural vocoding with Vocos
Latency: Real-time capable on modern GPUs

Limitations and Known Issues

Language Specific: Optimized for German text only
Training Data: Limited to specific German voice datasets
Accent Variation: May not capture all German regional accents
Performance: Requires GPU for real-time inference
Development Status: Still in active development

Contributing and Feedback

We need your help! This model is still being refined and we're looking for:

🗣️ Audio Quality Feedback: How does the generated speech sound?
📝 Text Handling: Issues with specific German words or phrases?
🐛 Bug Reports: Technical issues or errors
💡 Feature Requests: What would make this model more useful?
📊 Performance Reports: Speed and quality benchmarks
🎯 Use Case Examples: How are you using this model?

How to Provide Feedback

GitHub Issues: Report bugs or request features in the original F5-TTS repository
Audio Samples: Share problematic or excellent generation examples
Benchmarks: Compare with other German TTS systems
Documentation: Help improve usage instructions

Model Card

Property	Value
Language	German (Deutsch)
Model Type	Text-to-Speech (Flow Matching)
Architecture	DiT (Diffusion Transformer)
Parameters	~1B parameters
Training Data	Combined German voice datasets
Vocabulary	2,546 character tokens
Sample Rate	24 kHz

Citation

If you use this model in your research, please cite the original F5-TTS paper:

@article{chen2024f5tts,
  title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
  author={Chen, Yushen and others},
  journal={arXiv preprint arXiv:2410.06885},
  year={2024}
}

Acknowledgments

Original F5-TTS team for the excellent framework
German voice dataset contributors
The open-source community for feedback and improvements

Contact

For questions, feedback, or collaboration:

Open an issue in the F5-TTS repository
Join the community discussions
Share your experiences with German TTS
[email protected]

Status: 🚧 Under Development - Seeking Community Feedback 🚧

tabularisai
/

f5-tts-german-voice-clone