Kartoffelbox-v0.1 / README.md

Update README.md

4099232 verified 3 months ago

4.36 kB

	---
	library_name: chatterbox
	tags:
	- chatterbox
	- text-to-speech
	- tts
	- german
	- kartoffel
	- speech generation
	- voice-cloning
	language:
	- de
	base_model:
	- ResembleAI/chatterbox
	pipeline_tag: text-to-speech
	license: cc-by-nc-nd-4.0
	---

	# Kartoffel-TTS (Based on Chatterbox) - German Text-to-Speech
	> Modell is still in development and was only trained on 600k samples without emotion classification on my 2 RTX 3090s. I am currently in the process of setting up more data (>2.5M) and classify the exaggeration.

	## Updates
	- v0.2:
	- Added preview support for vocal expressions. Supported ones are: `<haha>`, `<hahaha>`, `<hahahaha>`, `<chuckle>`,`<wuhuuu>`, `<wow>`, `<hmm_neugierig>`, `<hmph>`, `<huh>`, `<ohhh>`, `<oooh>`, `<ughh>`, `<eeehhh>`, `<aaaaaaah>`, `<aaach>`.
	- Adjusted filestructure to match original chatterbox one with original s3, ve, etc. . The only trained file is the `t3_cfg.safetensor`. This should simplify the usage with different libraries.



	<video src="https://huggingface.co/SebastianBodza/Kartoffelbox-v0.1/resolve/main/demo_kartoffelbox.mp4" alt="Demo Video" width="400" controls></video>

	<div style="display: flex;align-items: center; gap: 12px">
	<a target="_blank" href="https://huggingface.co/spaces/SebastianBodza/Kartoffelbox">
	<img src="https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg" alt="Open in HuggingFace"/>
	</a>

	<a href="https://colab.research.google.com/drive/1ZNT08zrEuAeuH3VrsaMHeeqZFcZR8sHU?usp=sharing" rel="nofollow">
	<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab">
	</a>
	</div>

	## Updates

	- The model has been rebuilt using Chatterbox, Resemble AI's open-source TTS framework. This allows for emotion exaggeration control and improved stability.

	## Model Overview

	Kartoffel-TTS is a German text-to-speech (TTS) model family based on Chatterbox, designed for natural and expressive speech synthesis. The model supports emotion exaggeration control, and voice cloning.

	### Key Features:
	1. Emotion Exaggeration Control: Adjust the intensity of emotions in speech, from subtle to dramatic.
	2. Expressive Speech: Capable of producing speech with different emotional tones and expressions.
	3. Fine-Tuned for German: Optimized for German language synthesis with a focus on naturalness and clarity.


	## Installation

	Install the required libraries:

	```bash
	pip install chatterbox-tts
	```

	---

	## Usage Example

	Here’s how to generate speech using Kartoffel-TTS:

	```python
	import torch
	import soundfile as sf
	from chatterbox.tts import ChatterboxTTS
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file

	MODEL_REPO = "SebastianBodza/Kartoffelbox-v0.1"
	T3_CHECKPOINT_FILE = "t3_cfg.safetensors"
	device = "cuda" if torch.cuda.is_available() else "cpu"

	model = ChatterboxTTS.from_pretrained(device=device)

	print("Downloading and applying German patch...")
	checkpoint_path = hf_hub_download(repo_id=MODEL_REPO, filename=T3_CHECKPOINT_FILE)

	t3_state = load_file(checkpoint_path, device="cpu")

	model.t3.load_state_dict(t3_state)
	print("Patch applied successfully.")


	text = "Tief im verwunschenen Wald, wo die Bäume uralte Geheimnisse flüsterten, lebte ein kleiner Gnom namens Fips, der die Sprache der Tiere verstand."

	reference_audio_path = "/content/uitoll.mp3"
	output_path = "output_cloned_voice.wav"

	print("Generating speech...")
	with torch.inference_mode():
	wav = model.generate(
	text,
	audio_prompt_path=reference_audio_path,
	exaggeration=0.5,
	temperature=0.6,
	cfg_weight=0.3,
	)

	sf.write(output_path, wav.squeeze().cpu().numpy(), model.sr)
	print(f"Audio saved to {output_path}")
	```

	## Contributing

	To improve the model further, additional high-quality German audio data with good transcripts are needed, especially for sounds like laughter, sighs, or other non-verbal expressions. Short audio clips (up to 60 seconds) with accurate transcriptions are particularly valuable.

	For those with ideas or access to relevant data, collaboration opportunities are always welcome. Reach out to discuss potential contributions.

	## Acknowledgements

	This model builds on the following technologies:
	- Chatterbox by Resemble AI
	- Cosyvoice
	- HiFT-GAN
	- Llama
	- S3Tokenizer