F5-TTS German Fine-tuned Model

Model: F5-TTS Language: German Hugging Face

⚠️ Work in Progress: This model is still under development and optimization. We are actively seeking feedback from the community to improve its performance. Please share your experiences, issues, and suggestions!

Model Description

This is a German fine-tuned version of the F5-TTS (Flow Matching) model, specifically trained on German voice datasets. F5-TTS is a diffusion-transformer based text-to-speech system that uses flow matching for high-quality, natural-sounding speech synthesis.

Key Features

  • Language: German text-to-speech synthesis
  • Architecture: DiT (Diffusion Transformer) with ConvNeXt V2
  • Sample Rate: 24 kHz
  • Vocoder: Vocos for high-quality audio generation
  • Tokenization: Custom character-level tokenization for German text

Model Details

  • Base Model: F5TTS_v1_Base
  • Fine-tuning Dataset: Combined German voice dataset with character-level tokenization
  • Training Steps: ~298,000 steps
  • Vocabulary Size: 2,546 characters
  • Model Size: ~1.3GB (inference-optimized)

Installation

# Install F5-TTS
pip install f5-tts

# Or install from source for latest features
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
pip install -e .

Usage

Quick Start with Hugging Face Hub

import torch
import torchaudio
from f5_tts.api import F5TTS
from huggingface_hub import hf_hub_download

# Download model files from Hugging Face
model_file = hf_hub_download(
    repo_id="tabularisai/f5-tts-german-voice-clone",
    filename="model.pt"
)
vocab_file = hf_hub_download(
    repo_id="tabularisai/f5-tts-german-voice-clone", 
    filename="vocab.txt"
)

# Initialize the German F5-TTS model
f5tts = F5TTS(
    model="F5TTS_v1_Base",  # Use the base architecture
    ckpt_file=model_file,  # Downloaded model weights
    vocab_file=vocab_file,  # German vocabulary
    device="cuda" if torch.cuda.is_available() else "cpu"
)

# German text to synthesize
text = "Hallo, ich bin ein deutsches Text-zu-Sprache-System. Wie kann ich Ihnen heute helfen?"

# Reference audio 
ref_audio_path = "reference_german_voice.wav"
ref_text = "Dies ist eine Referenzaufnahme für die Stimmenklonierung."

# Generate speech
audio, sample_rate, seed = f5tts.infer(
    gen_text=text,
    ref_file=ref_audio_path,
    ref_text=ref_text,
    remove_silence=True,
    file_wave="output_german.wav",
)

Advanced Usage

# For longer texts, you can use the advanced inference (works with both Hugging Face and local files)
audio, sample_rate = f5tts.infer(
    text=text,
    ref_audio=ref_audio_path,
    ref_text=ref_text,
    nfe_step=32,  # Number of function evaluations (higher = better quality)
    cfg_strength=2.0,  # Classifier-free guidance strength
    sway_sampling_coef=-1.0,  # Sway sampling for better quality
    speed=1.0,  # Generation speed (1.0 = normal speed)
    remove_silence=True,
    cross_fade_duration=0.15  # For smoother concatenation
)

Command Line Usage

# Using the F5-TTS CLI with the German model
f5-tts_infer-cli \
    --model F5TTS_v1_Base \
    --ckpt_file path/to/model.pt \
    --vocab_file path/to/vocab.txt \
    --ref_audio reference_german.wav \
    --ref_text "Referenztext für die Stimme" \
    --gen_text "Zu synthetisierender deutscher Text" \
    --output_path output_german.wav

Voice Cloning

The model supports voice cloning with German reference audio:

# Use a German reference voice
ref_audio = "my_german_voice_sample.wav"
ref_text = "Das ist ein Beispieltext meiner Stimme."

# Clone the voice for new German text
new_text = "Jetzt spreche ich mit der geklonten Stimme diesen neuen Text."
audio, sr = f5tts.infer(text=new_text, ref_audio=ref_audio, ref_text=ref_text)

Model Performance

Supported Text Features

  • ✅ German characters and umlauts (ä, ö, ü, ß)
  • ✅ Numbers and punctuation
  • ✅ Special characters
  • ✅ Mixed case text
  • ⚠️ Limited support for non-German characters

Audio Quality

  • Sample Rate: 24 kHz
  • Bit Depth: 16-bit
  • Quality: High-quality neural vocoding with Vocos
  • Latency: Real-time capable on modern GPUs

Limitations and Known Issues

  • Language Specific: Optimized for German text only
  • Training Data: Limited to specific German voice datasets
  • Accent Variation: May not capture all German regional accents
  • Performance: Requires GPU for real-time inference
  • Development Status: Still in active development

Contributing and Feedback

We need your help! This model is still being refined and we're looking for:

  • 🗣️ Audio Quality Feedback: How does the generated speech sound?
  • 📝 Text Handling: Issues with specific German words or phrases?
  • 🐛 Bug Reports: Technical issues or errors
  • 💡 Feature Requests: What would make this model more useful?
  • 📊 Performance Reports: Speed and quality benchmarks
  • 🎯 Use Case Examples: How are you using this model?

How to Provide Feedback

  1. GitHub Issues: Report bugs or request features in the original F5-TTS repository
  2. Audio Samples: Share problematic or excellent generation examples
  3. Benchmarks: Compare with other German TTS systems
  4. Documentation: Help improve usage instructions

Model Card

Property Value
Language German (Deutsch)
Model Type Text-to-Speech (Flow Matching)
Architecture DiT (Diffusion Transformer)
Parameters ~1B parameters
Training Data Combined German voice datasets
Vocabulary 2,546 character tokens
Sample Rate 24 kHz

Citation

If you use this model in your research, please cite the original F5-TTS paper:

@article{chen2024f5tts,
  title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
  author={Chen, Yushen and others},
  journal={arXiv preprint arXiv:2410.06885},
  year={2024}
}

Acknowledgments

  • Original F5-TTS team for the excellent framework
  • German voice dataset contributors
  • The open-source community for feedback and improvements

Contact

For questions, feedback, or collaboration:

  • Open an issue in the F5-TTS repository
  • Join the community discussions
  • Share your experiences with German TTS
  • [email protected]

Status: 🚧 Under Development - Seeking Community Feedback 🚧

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tabularisai/f5-tts-german-voice-clone

Base model

SWivid/F5-TTS
Finetuned
(57)
this model