shichaog commited on
Commit
177edab
·
verified ·
1 Parent(s): 86c5869

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +150 -3
README.md CHANGED
@@ -1,3 +1,150 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ - en
6
+ pipeline_tag: text-to-speech
7
+ ---
8
+
9
+ <div align="center">
10
+ <div>&nbsp;</div>
11
+ <img src="logo.jpeg" width="300"/> <br>
12
+ </div>
13
+
14
+
15
+ <p align="center">
16
+ <a href="https://huggingface.co/your-username/your-model-name">
17
+ <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue">
18
+ </a>
19
+ <a href="LICENSE">
20
+ <img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-green.svg">
21
+ </a>
22
+ <a href="#">
23
+ <img alt="Python" src="https://img.shields.io/badge/Python-3.9+-blue.svg">
24
+ </a>
25
+ </p>
26
+
27
+ - ## MeloVC
28
+
29
+ **MeloVC** is a text-to-speech (TTS) project modified from [MeloTTS](https://github.com/myshell-ai/MeloTTS) , focusing on high-quality **bilingual (Chinese-English) **speech synthesis. It implements zero-shot voice cloning through **Speaker Embedding** technology.
30
+ Unlike the original MeloTTS, this project no longer uses `speaker_id`. Instead, it controls the timbre and style of the generated speech by extracting a speaker embedding (voiceprint) from any given reference audio clip.
31
+
32
+ ## ✨ Key Features
33
+
34
+ - **High-Quality Speech Synthesis:** All models are trained at a 44.1kHz sampling rate, delivering a clear, natural auditory experience with superior audio quality.
35
+ - **Focused Language Support:** This project specializes in Chinese and English, removing support for other languages to simplify the model architecture.
36
+ - **Zero-Shot Voice Cloning:** Leverages 192-dimensional Speaker Embeddings extracted using `speechbrain/spkrec-ecapa-voxceleb`. It requires only a 3-second reference audio clip, giving the model powerful zero-shot voice cloning capabilities. It also supports generating speech in a default voice (without a reference audio).
37
+ - **Bilingual Chinese-English Support:** Deeply optimized for handling mixed Chinese and English text, resulting in more accurate and fluent pronunciation.
38
+ - **Easy-to-Use Inference Interface:**Perform inference simply by providing `text` + `reference audio` or a `pre-extracted Speaker Embedding`.
39
+ - **Open-Source Pre-trained Multi-Speaker Model:** Trained for 72 hours (24*3) on a single V100 GPU. Available on [HuggingFace](https://huggingface.co/shichaog/MeloVC/).
40
+ - **Datasets:** Trained on a combination of different datasets to optimize performance in specific scenarios.
41
+ - Trained on a mix of the 200+ hour open-source VoxBox dataset and 16 hours of private data, enhancing generalization and timbre diversity. **Note:** Due to the limited amount of data, the effectiveness of cloning different voices may vary.
42
+ - Ai-shell3: 85 hours
43
+ - hifi_tts: 90 hours
44
+ - ravdess: 1 hour
45
+ - vctk: 41 hours
46
+ - Private data: 16 hours
47
+
48
+ ## 🚀 Quick Start
49
+
50
+ ### 1. Environment Setup
51
+
52
+ First, clone this repository and install the required dependencies.
53
+
54
+ ```
55
+ git clone https://github.com/shichaog/MeloVC.git
56
+ cd MeloVC
57
+ pip install -e .
58
+ python -m unidic download
59
+ ```
60
+
61
+ ### 2. Inference Examples
62
+
63
+ #### Command Line
64
+
65
+ Voice Cloning (using a reference audio)
66
+
67
+ ```
68
+ python infer.py --text "I'm learning machine learning recently, and I hope to make some achievements in the field of artificial intelligence in the future." --ref_audio_path /path/to/your/reference.wav -m /path/to/G_XXXX.pth -o ./cloned_output.wav
69
+ ```
70
+
71
+ Non-Cloning (using a default voice)
72
+
73
+ ```
74
+ python infer.py --text "I'm learning machine learning recently, and I hope to make some achievements in the field of artificial intelligence in the future." -m /path/to/G_XXXX.pth -o ./default_output.wav
75
+ ```
76
+
77
+ ## 🔧 Train Your Own Model
78
+
79
+ If you want to train a model on your own dataset or fine-tune the existing model, follow these steps:
80
+
81
+ ### 1. Setup Environment
82
+
83
+ Before training, install MeloVC in editable (developer) mode and navigate to the `melovc` directory:
84
+
85
+ ```shell
86
+ pip install -e .
87
+ cd melovc
88
+ ```
89
+
90
+ ### 2. Data Preparation
91
+
92
+ Prepare your dataset and create a `metadata.list` file with the following format:
93
+
94
+ ```shell
95
+ path/to/audio1.wav|LANGUAGE-CODE|This is the first text.
96
+ path/to/audio2.wav|LANGUAGE-CODE|这是第二段文本。
97
+ ...
98
+ ```
99
+
100
+ - Language Codes:
101
+ - Chinese only: `ZH`
102
+ - English only: `EN`
103
+ - Mixed Chinese & English: `ZH_MIX_EN`
104
+ - Audio Format: WAV files with a 44.1kHz sampling rate are recommended.
105
+ - Text: Ensure the text corresponds to the audio content. It's a good practice to clean the text and verify its accuracy using an ASR model like Whisper.
106
+ - Recommendations for Best Results:
107
+ - Single-speaker model: At least 10 hours of high-quality audio is recommended.
108
+ - Large multi-speaker model: The more data, the better, as it's harder to track per-speaker duration.
109
+ An example can be found at `data/example/metadata.list`.
110
+ Once your data is ready, run the preprocessing script:
111
+
112
+ ```
113
+ python preprocess_text.py --metadata path/to/metadata.list --config_path path/to/config.json
114
+ ```
115
+
116
+ This will pre-compute the BERT, spectral, and speaker embedding information needed for training, which significantly speeds up the process. After processing, it will generate `config.json`, `train.list`, and `val.list` files in the same directory as your `metadata.list`. These files contain model, training, and data configurations.
117
+
118
+ ### 3. Modify the Configuration File
119
+
120
+ Copy and modify the `configs/config.json` file, paying close attention to the following sections:
121
+
122
+ - `data` -> `training_files`: Point this to your `train.list` file.
123
+ - `data` -> `embedding_dir`: Point this to the directory where you saved the Speaker Embeddings (if pre-computed).
124
+ - `train`: Adjust training parameters like `batch_size`, `epochs`, etc.
125
+
126
+ ### 4. Start Training
127
+
128
+ ```
129
+ bash train.sh <path/to/config.json> <num_of_gpus>
130
+ # Example:
131
+ bash train.sh path/to/your/config.json 1
132
+ ```
133
+
134
+ This will create a `logs` directory in the current path, containing training logs and model checkpoints. You can use TensorBoard to monitor the training progress. During the process, the script will download necessary model files from [HuggingFace](https://huggingface.co/shichaog/MeloVC/).
135
+
136
+ ## Author
137
+
138
+ - [shichaog](https://github.com/shichaog/)
139
+ If you find this project useful, please consider contributing to its future development.
140
+
141
+ ## 📜 License
142
+
143
+ This project is licensed under the Apache 2.0 License.
144
+
145
+ ## 🙏 Acknowledgements
146
+
147
+ - Special thanks to the teams behind [MeloTTS](https://github.com/myshell-ai/MeloTTS), [VITS](https://github.com/jaywalnut310/vits), [VITS2](https://github.com/daniilrobnikov/vits2) and [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)for their foundational work.
148
+ - Thanks to [SpeechBrain](https://github.com/speechbrain/speechbrain) for providing the powerful pre-trained speaker embedding extraction model.
149
+ - Thanks to [SparkAudio](https://github.com/SparkAudio/VoxBox) team for making their dataset publicly available.
150
+