🗣️ XCodec2 Trained on 100K Hours of Multilingual Data

This is a retrained version of the XCodec2 neural audio codec by HKUSTAudio, using 100,000 hours of multilingual speech across seven languages. The model enables efficient speech compression and reconstruction for low-bandwidth, high-quality audio applications. Its discrete token outputs are well-suited for LLM-based TTS, AudioLM, multimodal models, and speech-to-speech systems, making it a versatile solution for multilingual and real-world speech processing tasks.

📌 Overview

Model Architecture: Xcodec2
Sampling Rate: 16 kHz
Tokens: 50 tokens/second
Developed By: Verbex.ai (Hishab Technologies Ltd.)
Primary Use Case: High-quality speech reconstruction and intermediate TTS representations
Training Time: 11 Days(8xH100 80GB)
Epoch: 1

🧪 Installation & Usage

This model requires xcodec2. We recommend using a minimal setup:

# Create environment
conda create -n xcodec2 python=3.9
conda activate xcodec2

# Install dependencies
pip install xcodec2==0.1.5
pip install numpy==1.26.4

Example Usage

import torch
import soundfile as sf
from xcodec2.modeling_xcodec2 import XCodec2Model

model_path = "hishab/titu-xcodec2"  # Replace with actual Hugging Face path
model = XCodec2Model.from_pretrained(model_path)
model.eval().cuda()

# Load and preprocess waveform
wav, sr = sf.read("test_bn.wav")
if sr != 16000:
    import librosa
    wav = librosa.resample(wav, orig_sr=sr, target_sr=16000)
    sr = 16000
if len(wav.shape) > 1:
    wav = wav.mean(axis=1)
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0)

# Encode and decode
with torch.no_grad():
    vq_code = model.encode_code(input_waveform=wav_tensor)
    print("Code:", vq_code)

    recon_wav = model.decode_code(vq_code).cpu()

# Save output
sf.write("reconstructed_bn.wav", recon_wav[0, 0].numpy(), sr)
print("Done! Check reconstructed_bn.wav")

🌍 Multilingual Training Dataset

Language	Dataset(s)	Hours (K)
Japanese	EmiliaYODAS + Verbex JA TTS Dataset	31.41
English	EmiliaYODAS	25.69
Chinese	EmiliaYODAS	12.50
Bangla	Verbex Bengali TTS Dataset	11.58
French	EmiliaYODAS + MLangLibrispeech	8.40
German	EmiliaYODAS + MLangLibrispeech	5.42
Korean	EmiliaYODAS	5.00
Total	—	100

📊 Reconstruction Evaluation

Reconstruction metrics are computed over 100 samples for English, Japanese, and Bangla using this retrained model (XCODEC2 Ours) alongside baselines (XCODEC, SNAC, NEMO).

Evaluation Test Sets:

English: 100 Examples (Emilia Dataset)
Japanese: 100 Examples (Emilia Dataset)
Bangla: 100 Examples (Verbex's Inhouse TTS Dataset)

Model	Lang	MCD ↓	MSE ↓	SpeechBERTScore ↑	SpeechBLEU ↑	SpeechTokenDist ↑
XCODEC	BN	2.823	0.003	0.939	0.500	0.816
	EN	3.166	0.012	0.962	0.660	0.856
	JA	3.021	0.010	0.948	0.582	0.838
Overall		3.003	0.008	0.949	0.581	0.837
XCODEC2 (Ours)	BN	2.712	0.003	0.940	0.508	0.817
	EN	3.206	0.014	0.957	0.644	0.851
	JA	3.022	0.012	0.946	0.573	0.838
Overall		2.980	0.010	0.948	0.575	0.835
hubertsiuzdak/snac_24khz	BN	3.104	0.002	0.911	0.442	0.785
	EN	3.983	0.014	0.912	0.541	0.797
	JA	3.512	0.009	0.903	0.472	0.761
Overall		3.533	0.008	0.909	0.485	0.781
nvidia/low-frame-rate-speech-codec-22khz	BN	2.247	0.000	0.957	0.589	0.863
	EN	2.867	0.007	0.969	0.707	0.872
	JA	2.677	0.003	0.955	0.614	0.853
Overall		2.597	0.003	0.960	0.636	0.863

SpeechBERTScore, SpeechBLEU and SpeechTokenDistance are calculated using https://github.com/Takaaki-Saeki/DiscreteSpeechMetrics

✅ Intended Use

This model is suitable for:

Speech tokenization in TTS pipelines
Low-bitrate speech compression
Code-based speech synthesis or generation tasks
Multimodal LLM, Audio LM, Speech-to-Speech and etc. modeling

🚫 Limitations

Licensed for non-commercial use only

📄 License

This model is licensed under Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0).
Commercial usage is not allowed.

SPDX Identifier: CC-BY-NC-4.0
License Details: https://creativecommons.org/licenses/by-nc/4.0

📬 Contact

For research collaborations, feedback, or commercial licensing inquiries, please reach out to:

hishab
/

titu-xcodec2