🗣️ XCodec2 Trained on 100K Hours of Multilingual Data

This is a retrained version of the XCodec2 neural audio codec by HKUSTAudio, using 100,000 hours of multilingual speech across seven languages. The model enables efficient speech compression and reconstruction for low-bandwidth, high-quality audio applications. Its discrete token outputs are well-suited for LLM-based TTS, AudioLM, multimodal models, and speech-to-speech systems, making it a versatile solution for multilingual and real-world speech processing tasks.


📌 Overview

  • Model Architecture: Xcodec2
  • Sampling Rate: 16 kHz
  • Tokens: 50 tokens/second
  • Developed By: Verbex.ai (Hishab Technologies Ltd.)
  • Primary Use Case: High-quality speech reconstruction and intermediate TTS representations
  • Training Time: 11 Days(8xH100 80GB)
  • Epoch: 1

🧪 Installation & Usage

This model requires xcodec2. We recommend using a minimal setup:

# Create environment
conda create -n xcodec2 python=3.9
conda activate xcodec2

# Install dependencies
pip install xcodec2==0.1.5
pip install numpy==1.26.4

Example Usage

import torch
import soundfile as sf
from xcodec2.modeling_xcodec2 import XCodec2Model

model_path = "hishab/titu-xcodec2"  # Replace with actual Hugging Face path
model = XCodec2Model.from_pretrained(model_path)
model.eval().cuda()

# Load and preprocess waveform
wav, sr = sf.read("test_bn.wav")
if sr != 16000:
    import librosa
    wav = librosa.resample(wav, orig_sr=sr, target_sr=16000)
    sr = 16000
if len(wav.shape) > 1:
    wav = wav.mean(axis=1)
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0)

# Encode and decode
with torch.no_grad():
    vq_code = model.encode_code(input_waveform=wav_tensor)
    print("Code:", vq_code)

    recon_wav = model.decode_code(vq_code).cpu()

# Save output
sf.write("reconstructed_bn.wav", recon_wav[0, 0].numpy(), sr)
print("Done! Check reconstructed_bn.wav")

🌍 Multilingual Training Dataset

Language Dataset(s) Hours (K)
Japanese EmiliaYODAS + Verbex JA TTS Dataset 31.41
English EmiliaYODAS 25.69
Chinese EmiliaYODAS 12.50
Bangla Verbex Bengali TTS Dataset 11.58
French EmiliaYODAS + MLangLibrispeech 8.40
German EmiliaYODAS + MLangLibrispeech 5.42
Korean EmiliaYODAS 5.00
Total 100

📊 Reconstruction Evaluation

Reconstruction metrics are computed over 100 samples for English, Japanese, and Bangla using this retrained model (XCODEC2 Ours) alongside baselines (XCODEC, SNAC, NEMO).

Evaluation Test Sets:

  • English: 100 Examples (Emilia Dataset)
  • Japanese: 100 Examples (Emilia Dataset)
  • Bangla: 100 Examples (Verbex's Inhouse TTS Dataset)
Model Lang MCD ↓ MSE ↓ SpeechBERTScore ↑ SpeechBLEU ↑ SpeechTokenDist ↑
XCODEC BN 2.823 0.003 0.939 0.500 0.816
EN 3.166 0.012 0.962 0.660 0.856
JA 3.021 0.010 0.948 0.582 0.838
Overall 3.003 0.008 0.949 0.581 0.837
XCODEC2 (Ours) BN 2.712 0.003 0.940 0.508 0.817
EN 3.206 0.014 0.957 0.644 0.851
JA 3.022 0.012 0.946 0.573 0.838
Overall 2.980 0.010 0.948 0.575 0.835
hubertsiuzdak/snac_24khz BN 3.104 0.002 0.911 0.442 0.785
EN 3.983 0.014 0.912 0.541 0.797
JA 3.512 0.009 0.903 0.472 0.761
Overall 3.533 0.008 0.909 0.485 0.781
nvidia/low-frame-rate-speech-codec-22khz BN 2.247 0.000 0.957 0.589 0.863
EN 2.867 0.007 0.969 0.707 0.872
JA 2.677 0.003 0.955 0.614 0.853
Overall 2.597 0.003 0.960 0.636 0.863

SpeechBERTScore, SpeechBLEU and SpeechTokenDistance are calculated using https://github.com/Takaaki-Saeki/DiscreteSpeechMetrics


✅ Intended Use

This model is suitable for:

  • Speech tokenization in TTS pipelines
  • Low-bitrate speech compression
  • Code-based speech synthesis or generation tasks
  • Multimodal LLM, Audio LM, Speech-to-Speech and etc. modeling

🚫 Limitations

  • Licensed for non-commercial use only

📄 License

This model is licensed under Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0).
Commercial usage is not allowed.


📬 Contact

For research collaborations, feedback, or commercial licensing inquiries, please reach out to:

🌐 Website: https://verbex.ai

Downloads last month
8
Safetensors
Model size
823M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support