hynt commited on
Commit
4b65bde
·
verified ·
1 Parent(s): 0b17e0c

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -3
README.md CHANGED
@@ -1,3 +1,63 @@
1
- ---
2
- license: cc-by-nc-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - text-to-speech
4
+ - vietnamese
5
+ - ai-model
6
+ - deep-learning
7
+ license: cc-by-nc-sa-4.0
8
+ library_name: pytorch
9
+ datasets:
10
+ - VLSP2021
11
+ - VLSP2022
12
+ - VLSP2023
13
+ - vietTTS
14
+ - UEH
15
+ model_name: ZipVoice-Vietnamese-150h
16
+ language: vi
17
+ ---
18
+
19
+ # 🛑 Important Note ⚠️
20
+ This model is only intended for **research purposes**.
21
+ **Access requests must be made using an institutional, academic, or corporate email**. Requests from public email providers will be denied. We appreciate your understanding.
22
+
23
+ # 🎙️ ZipVoice-Vietnamese-150h
24
+ ZipVoice is a series of fast and high-quality zero-shot TTS models based on flow matching.
25
+
26
+ Key features:
27
+ 1. Small and fast: only 123M parameters.
28
+
29
+ 2. High-quality voice cloning: state-of-the-art performance in speaker similarity, intelligibility, and naturalness.
30
+
31
+ 3. Multi-lingual: support Chinese and English.
32
+
33
+ 4. Multi-mode: support both single-speaker and dialogue speech generation.
34
+
35
+ This checkpoint is a compact fine-tuned version of ZipVoice trained on 150 hours of Vietnamese speech.
36
+
37
+ 🔗 For more fine-tuning and inference experiments, visit: https://github.com/k2-fsa/ZipVoice.
38
+
39
+ 📜 **License:** [CC-BY-NC-SA-4.0](https://spdx.org/licenses/CC-BY-NC-SA-4.0) — Non-commercial research use only.
40
+
41
+ ---
42
+
43
+ ## 📌 Model Details
44
+
45
+ - **Dataset:** VLSP 2021, VLSP 2022, VLSP 2023, VietTTS, TeacherDinh-UEH and some speech sources from YouTube channels.
46
+ - **Total dataset durations:** 150 hours
47
+ - **Data processing Technique:**
48
+ - Remove all music background from audios, using facebook demucs model: https://github.com/facebookresearch/demucs
49
+ - Do not use audio files shorter than 1 second or longer than 30 seconds.
50
+ - Keep the default punctuation marks unchanged.
51
+ - Normalize to lowercase format.
52
+ - **Training Configuration:**
53
+ - **Base Model:** ZipVoice with espeak-ng vi for tokenizer
54
+ - **GPU:** RTX 3090
55
+ - **Batch Siz:** Max duration 200
56
+ - **Training Progress:** Stopped at **96,000 steps at epoch 30**
57
+
58
+ ---
59
+
60
+ ## 🛑 Update Note
61
+ Thank you, Teacher Định from the University of Economics Ho Chi Minh City (UEH), for providing me with an additional 50-hours high-quality labeled dataset.
62
+
63
+ Him contact: https://www.facebook.com/luudinhit93