jahid commited on
Commit
bc4f968
·
1 Parent(s): 66d626c

Initial upload of XCodec2 retrained model

Browse files
Files changed (2) hide show
  1. LICENSE +1 -0
  2. README.md +166 -0
LICENSE ADDED
@@ -0,0 +1 @@
 
 
1
+ CC-BY-NC-4.0
README.md ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ tags:
4
+ - audio
5
+ - codec
6
+ - speech
7
+ - xcodec2
8
+ - text-to-speech
9
+ - multilingual
10
+ language:
11
+ - en
12
+ - ja
13
+ - zh
14
+ - bn
15
+ - fr
16
+ - de
17
+ - ko
18
+ ---
19
+
20
+ # 🗣️ XCodec2 Retrained (Multilingual, 100 Hours)
21
+
22
+ This model is a retrained version of [HKUSTAudio/xcodec2](https://huggingface.co/HKUSTAudio/xcodec2), trained on a 100K-hour multilingual dataset across 7 languages. It is optimized for speech representation learning, compression, and high-fidelity reconstruction — particularly useful for TTS and bandwidth-efficient speech synthesis.
23
+
24
+ ---
25
+
26
+ ## 📌 Overview
27
+
28
+ - **Base Model:** [HKUSTAudio/xcodec2](https://huggingface.co/HKUSTAudio/xcodec2)
29
+ - **Sampling Rate:** 16 kHz
30
+ - **Tokens:** 50 tokens/second
31
+ - **Developed By:** [Verbex.ai (Hishab Technologies Ltd.)](https://verbex.ai)
32
+ - **Primary Use Case:** High-quality speech reconstruction and intermediate TTS representations
33
+ - **Training Time:** 11 Days
34
+ - **Epoch:** 1
35
+ - **Compute:** 8xH100 80GB
36
+
37
+ ---
38
+
39
+ ## 🧪 Installation & Usage
40
+
41
+ This model requires `xcodec2`. We recommend using a minimal setup:
42
+
43
+ ```bash
44
+ # Create environment
45
+ conda create -n xcodec2 python=3.9
46
+ conda activate xcodec2
47
+
48
+ # Install xcodec2 (choose one)
49
+ pip install xcodec2==0.1.5 # Modified, fewer dependencies (recommended for inference and LLASA fine-tuning)
50
+ # OR
51
+ pip install xcodec2==0.1.3 # Original, more stable during training
52
+ ```
53
+
54
+ ### Example Usage
55
+
56
+ ```python
57
+ import torch
58
+ import soundfile as sf
59
+ from xcodec2.modeling_xcodec2 import XCodec2Model
60
+
61
+ model_path = "hishab/titu-xcodec2" # Replace with actual Hugging Face path
62
+ model = XCodec2Model.from_pretrained(model_path)
63
+ model.eval().cuda()
64
+
65
+ # Load and preprocess waveform
66
+ wav, sr = sf.read("test_bn.wav")
67
+ if sr != 16000:
68
+ import librosa
69
+ wav = librosa.resample(wav, orig_sr=sr, target_sr=16000)
70
+ sr = 16000
71
+ if len(wav.shape) > 1:
72
+ wav = wav.mean(axis=1)
73
+ wav_tensor = torch.from_numpy(wav).float().unsqueeze(0)
74
+
75
+ # Encode and decode
76
+ with torch.no_grad():
77
+ vq_code = model.encode_code(input_waveform=wav_tensor)
78
+ print("Code:", vq_code)
79
+
80
+ recon_wav = model.decode_code(vq_code).cpu()
81
+
82
+ # Save output
83
+ sf.write("reconstructed_bn.wav", recon_wav[0, 0].numpy(), sr)
84
+ print("Done! Check reconstructed_bn.wav")
85
+ ```
86
+
87
+ ---
88
+
89
+ ## 🌍 Multilingual Training Dataset
90
+
91
+ | Language | Dataset(s) | Hours (K) |
92
+ |-----------|----------------------------------------|-----------|
93
+ | Japanese | EmiliaYODAS + Verbex JA TTS Dataset | 31.41 |
94
+ | English | EmiliaYODAS | 25.69 |
95
+ | Chinese | EmiliaYODAS | 12.50 |
96
+ | Bangla | Verbex Bengali TTS Dataset | 11.58 |
97
+ | French | EmiliaYODAS + MLangLibrispeech | 8.40 |
98
+ | German | EmiliaYODAS + MLangLibrispeech | 5.42 |
99
+ | Korean | EmiliaYODAS | 5.00 |
100
+ | **Total** | — | **100** |
101
+
102
+ ---
103
+
104
+ ## 📊 Reconstruction Evaluation
105
+
106
+ Reconstruction metrics are computed over 100 samples for English, Japanese, and Bangla using this retrained model (`XCODEC2 Ours`) alongside baselines (XCODEC, SNAC, NEMO).
107
+
108
+ **Evaluation Test Sets:**
109
+ - English: 100 Examples (Emilia Dataset @ 24 kHz)
110
+ - Japanese: 100 Examples (Emilia Dataset @ 24 kHz)
111
+ - Bangla: 100 Examples (Inhouse TTS Dataset @ 22.05 kHz)
112
+
113
+ | Model | Lang | MCD ↓ | MSE ↑ | BERTScore ↑ | BLEU ↑ | TokenDist ↑ |
114
+ |-------------------|------|--------|--------|-------------|--------|-------------|
115
+ | **XCODEC** | BN | 2.823 | 0.003 | 0.939 | 0.500 | 0.816 |
116
+ | | EN | 3.166 | 0.012 | 0.962 | 0.660 | 0.856 |
117
+ | | JA | 3.021 | 0.010 | 0.948 | 0.582 | 0.838 |
118
+ | **Overall** | | 3.003 | 0.008 | 0.949 | 0.581 | 0.837 |
119
+ | **XCODEC2 (Ours)** | BN | 2.712 | 0.003 | 0.940 | 0.508 | 0.817 |
120
+ | | EN | 3.206 | 0.014 | 0.957 | 0.644 | 0.851 |
121
+ | | JA | 3.022 | 0.012 | 0.946 | 0.573 | 0.838 |
122
+ | **Overall** | | 2.980 | 0.010 | 0.948 | 0.575 | 0.835 |
123
+ | **hubertsiuzdak/snac_24khz** | BN | 3.104 | 0.002 | 0.911 | 0.442 | 0.785 |
124
+ | | EN | 3.983 | 0.014 | 0.912 | 0.541 | 0.797 |
125
+ | | JA | 3.512 | 0.009 | 0.903 | 0.472 | 0.761 |
126
+ | **Overall** | | 3.533 | 0.008 | 0.909 | 0.485 | 0.781 |
127
+ | **nvidia/low-frame-rate-speech-codec-22khz** | BN | 2.247 | 0.000 | 0.957 | 0.589 | 0.863 |
128
+ | | EN | 2.867 | 0.007 | 0.969 | 0.707 | 0.872 |
129
+ | | JA | 2.677 | 0.003 | 0.955 | 0.614 | 0.853 |
130
+ | **Overall** | | 2.597 | 0.003 | 0.960 | 0.636 | 0.863 |
131
+
132
+ ---
133
+
134
+ ## ✅ Intended Use
135
+
136
+ This model is suitable for:
137
+
138
+ - Speech tokenization in TTS pipelines
139
+ - Low-bitrate speech compression
140
+ - Representation learning and fine-tuning (e.g., LLASA-style)
141
+ - Code-based speech synthesis or generation tasks
142
+
143
+ ---
144
+
145
+ ## 🚫 Limitations
146
+ - Licensed for **non-commercial use only**
147
+
148
+ ---
149
+
150
+ ## 📄 License
151
+
152
+ This model is licensed under **Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0)**.
153
+ Commercial usage is **not allowed**.
154
+
155
+ - SPDX Identifier: `CC-BY-NC-4.0`
156
+ - License Details: [https://creativecommons.org/licenses/by-nc/4.0](https://creativecommons.org/licenses/by-nc/4.0)
157
+
158
+ ---
159
+
160
+ ## 📬 Contact
161
+
162
+ For research collaborations, feedback, or commercial licensing inquiries, please reach out to:
163
+
164
+ **📧 Email:** [[email protected]]
165
+ **🌐 Website:** [https://verbex.ai](https://verbex.ai)
166
+ ---