Initial upload of XCodec2 retrained model

Browse files

Files changed (2) hide show

LICENSE +1 -0
README.md +166 -0

LICENSE ADDED Viewed

	@@ -0,0 +1 @@


1	+ CC-BY-NC-4.0

README.md ADDED Viewed

	@@ -0,0 +1,166 @@

+---
+license: cc-by-nc-4.0
+tags:
+  - audio
+  - codec
+  - speech
+  - xcodec2
+  - text-to-speech
+  - multilingual
+language:
+  - en
+  - ja
+  - zh
+  - bn
+  - fr
+  - de
+  - ko
+---
+# 🗣️ XCodec2 Retrained (Multilingual, 100 Hours)
+This model is a retrained version of [HKUSTAudio/xcodec2](https://huggingface.co/HKUSTAudio/xcodec2), trained on a 100K-hour multilingual dataset across 7 languages. It is optimized for speech representation learning, compression, and high-fidelity reconstruction — particularly useful for TTS and bandwidth-efficient speech synthesis.
+---
+## 📌 Overview
+- **Base Model:** [HKUSTAudio/xcodec2](https://huggingface.co/HKUSTAudio/xcodec2)
+- **Sampling Rate:** 16 kHz
+- **Tokens:** 50 tokens/second
+- **Developed By:** [Verbex.ai (Hishab Technologies Ltd.)](https://verbex.ai)
+- **Primary Use Case:** High-quality speech reconstruction and intermediate TTS representations
+- **Training Time:** 11 Days
+- **Epoch:** 1
+- **Compute:** 8xH100 80GB
+---
+## 🧪 Installation & Usage
+This model requires `xcodec2`. We recommend using a minimal setup:
+```bash
+# Create environment
+conda create -n xcodec2 python=3.9
+conda activate xcodec2
+# Install xcodec2 (choose one)
+pip install xcodec2==0.1.5  # Modified, fewer dependencies (recommended for inference and LLASA fine-tuning)
+# OR
+pip install xcodec2==0.1.3  # Original, more stable during training
+```
+### Example Usage
+```python
+import torch
+import soundfile as sf
+from xcodec2.modeling_xcodec2 import XCodec2Model
+model_path = "hishab/titu-xcodec2"  # Replace with actual Hugging Face path
+model = XCodec2Model.from_pretrained(model_path)
+model.eval().cuda()
+# Load and preprocess waveform
+wav, sr = sf.read("test_bn.wav")
+if sr != 16000:
+    import librosa
+    wav = librosa.resample(wav, orig_sr=sr, target_sr=16000)
+    sr = 16000
+if len(wav.shape) > 1:
+    wav = wav.mean(axis=1)
+wav_tensor = torch.from_numpy(wav).float().unsqueeze(0)
+# Encode and decode
+with torch.no_grad():
+    vq_code = model.encode_code(input_waveform=wav_tensor)
+    print("Code:", vq_code)
+    recon_wav = model.decode_code(vq_code).cpu()
+# Save output
+sf.write("reconstructed_bn.wav", recon_wav[0, 0].numpy(), sr)
+print("Done! Check reconstructed_bn.wav")
+```
+---
+## 🌍 Multilingual Training Dataset
+| Language  | Dataset(s)                            | Hours (K) |
+|-----------|----------------------------------------|-----------|
+| Japanese  | EmiliaYODAS + Verbex JA TTS Dataset    | 31.41     |
+| English   | EmiliaYODAS                            | 25.69     |
+| Chinese   | EmiliaYODAS                            | 12.50     |
+| Bangla    | Verbex Bengali TTS Dataset             | 11.58     |
+| French    | EmiliaYODAS + MLangLibrispeech         | 8.40      |
+| German    | EmiliaYODAS + MLangLibrispeech         | 5.42      |
+| Korean    | EmiliaYODAS                            | 5.00      |
+| **Total** | —                                      | **100**   |
+---
+## 📊 Reconstruction Evaluation
+Reconstruction metrics are computed over 100 samples for English, Japanese, and Bangla using this retrained model (`XCODEC2 Ours`) alongside baselines (XCODEC, SNAC, NEMO).
+**Evaluation Test Sets:**
+- English: 100 Examples (Emilia Dataset @ 24 kHz)
+- Japanese: 100 Examples (Emilia Dataset @ 24 kHz)
+- Bangla: 100 Examples (Inhouse TTS Dataset @ 22.05 kHz)
+| Model             | Lang | MCD ↓   | MSE ↑   | BERTScore ↑ | BLEU ↑  | TokenDist ↑ |
+|-------------------|------|--------|--------|-------------|--------|-------------|
+| **XCODEC**        | BN   | 2.823  | 0.003  | 0.939       | 0.500  | 0.816       |
+|                   | EN   | 3.166  | 0.012  | 0.962       | 0.660  | 0.856       |
+|                   | JA   | 3.021  | 0.010  | 0.948       | 0.582  | 0.838       |
+| **Overall**           |     | 3.003  | 0.008  | 0.949       | 0.581  | 0.837       |
+| **XCODEC2 (Ours)** | BN   | 2.712  | 0.003  | 0.940       | 0.508  | 0.817       |
+|                   | EN   | 3.206  | 0.014  | 0.957       | 0.644  | 0.851       |
+|                   | JA   | 3.022  | 0.012  | 0.946       | 0.573  | 0.838       |
+| **Overall**           |     | 2.980  | 0.010  | 0.948       | 0.575  | 0.835       |
+| **hubertsiuzdak/snac_24khz**  | BN   | 3.104  | 0.002  | 0.911       | 0.442  | 0.785       |
+|                   | EN   | 3.983  | 0.014  | 0.912       | 0.541  | 0.797       |
+|                   | JA   | 3.512  | 0.009  | 0.903       | 0.472  | 0.761       |
+| **Overall**           |     | 3.533  | 0.008  | 0.909       | 0.485  | 0.781       |
+| **nvidia/low-frame-rate-speech-codec-22khz**  | BN   | 2.247  | 0.000  | 0.957       | 0.589  | 0.863       |
+|                   | EN   | 2.867  | 0.007  | 0.969       | 0.707  | 0.872       |
+|                   | JA   | 2.677  | 0.003  | 0.955       | 0.614  | 0.853       |
+| **Overall**           |     | 2.597  | 0.003  | 0.960       | 0.636  | 0.863       |
+---
+## ✅ Intended Use
+This model is suitable for:
+- Speech tokenization in TTS pipelines
+- Low-bitrate speech compression
+- Representation learning and fine-tuning (e.g., LLASA-style)
+- Code-based speech synthesis or generation tasks
+---
+## 🚫 Limitations
+- Licensed for **non-commercial use only**
+---
+## 📄 License
+This model is licensed under **Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0)**.
+Commercial usage is **not allowed**.
+- SPDX Identifier: `CC-BY-NC-4.0`
+- License Details: [https://creativecommons.org/licenses/by-nc/4.0](https://creativecommons.org/licenses/by-nc/4.0)
+---
+## 📬 Contact
+For research collaborations, feedback, or commercial licensing inquiries, please reach out to:
+**📧 Email:** [[email protected]]
+**🌐 Website:** [https://verbex.ai](https://verbex.ai)
+---