XLS-R Deep Learning Model for Multilingual ASR

A fine-tuned model for Automatic Speech Recognition on low-resource Southeast Asian languages: Indonesian, Javanese, and Sundanese.

How to Use · Evaluation Results · Citation · Try on Spaces · Read the Paper

This is a fine-tuned version of the facebook/wav2vec2-xls-r-300m model for Automatic Speech Recognition (ASR). It is the official model from the research paper "XLS-R DEEP LEARNING MODEL FOR MULTILINGUAL ASR ON LOW-RESOURCE LANGUAGES: INDONESIAN, JAVANESE, AND SUNDANESE".

The goal of this research is to improve ASR performance in converting spoken language into written text for Indonesian, Javanese, and Sundanese. The model's accuracy is significantly enhanced by integrating a 5-gram KenLM language model, which substantially reduces the Word Error Rate (WER).

Proposed Methodology

Model Details

Base Model: facebook/wav2vec2-xls-r-300m (a version with 300 million parameters was used).
Task: Multilingual Automatic Speech Recognition (ASR).
Languages: Indonesian, Javanese, Sundanese.
Library: Transformers.
Framework: Deep Learning, based on the wav2vec 2.0 and Transformer architecture.

Authors

Panji Arisaputra
Alif Tri Handoyo
Amalia Zahra

Computer Science Department, Bina Nusantara University, Jakarta, Indonesia.

Datasets Used for Training

A total of seven datasets were combined for this study to create a robust multilingual corpus.

Speech Data for Fine-Tuning:

Indonesian:
- TITML-IDN: A phonetically balanced collection of 343 sentences from 20 speakers, totaling 14.5 hours of audio.
- Magic Data Corpus: 3.5 hours of scripted speeches from 10 Indonesian speakers.
- Common Voice (Indonesian): Utilized the train, validation, and test subsets, comprising 5,809 instances from 170 individuals.
Javanese & Sundanese:
- OpenSLR SLR35 & SLR36: Large ASR training datasets with speech recordings from native speakers. Due to computational limits, only the first three .zip files from each dataset were used.
- OpenSLR SLR41 & SLR44: High-quality Text-to-Speech (TTS) data used for ASR training.

Text Data for Language Model:

In addition to the transcripts from the above datasets, the OSCAR corpus (unshuffled_deduplicated id subset) was used as an augmentation technique for building the KenLM language model. Only 6% of its 2.3 billion Indonesian words were used to avoid disproportionately influencing the model.

Data Preprocessing

The datasets underwent a standardized preprocessing pipeline:

Data Splitting: Each dataset was split into 90% for training and 10% for testing. The training portion was further subdivided into 90% for the train set and 10% for the validation set.
Audio Standardization: Audio files were converted to WAV format with a single channel and resampled to a 16 kHz sampling rate.
Text Normalization: Transcriptions were cleaned by removing special characters and converting all text to lowercase.

Evaluation and Results

The model was evaluated against a previous model, XLSR-53, using the Word Error Rate (WER) metric. The integration of a 5-gram KenLM language model demonstrated the best overall performance, achieving a significant reduction in WER and establishing a new benchmark for these languages.

The XLS-R 300m model maintains a competitive edge by supporting Javanese and Sundanese in addition to Indonesian, which the previous XLSR-53 model did not.

Model	Data train & val	KenLM	Data testing
Model	Data train & val	KenLM	TITML-IDN	Magic-Data	Common Voice	SLR 35	SLR 36	SLR 41	SLR 44	AVG
XLS-R 300m ASR multi-lingual model	TITML-IDN Magic Data Common Voice SLR35 SLR36 SLR41 SLR44	—	7.73	19.64	15.30	17.95	2.39	21.99	7.10	13.16
		2-gram	1.79	10.93	6.55	7.76	1.20	10.90	3.58	6.10
		3-gram	1.39	10.38	5.63	6.50	1.15	10.41	3.47	5.56
		4-gram	1.37	10.38	5.11	6.38	1.15	10.31	3.47	5.45
		5-gram	1.37	10.38	4.99	6.41	1.14	10.25	3.44	5.43
		6-gram	1.37	10.38	5.01	6.41	1.14	10.35	3.44	5.44
XLSR-53 ASR model	TITML-IDN Magic Data Common Voice	—	2.17	16.75	—	—	—	—	—	9.46
		2-gram	0.77	10.78	—	—	—	—	—	5.77
		3-gram	0.72	10.88	—	—	—	—	—	5.80
		4-gram	0.72	10.88	—	—	—	—	—	5.80
		5-gram	0.72	10.93	—	—	—	—	—	5.82

*WER results extracted from Table 4 of the research paper.

How to Use

You can use this model with the transformers library pipeline. For optimal performance, as demonstrated in the research paper, we strongly recommend integrating the provided 5-gram KenLM language model.

pip install transformers torch torchaudio librosa
# For decoding with the language model:
pip install pyctcdecode==0.4.0 kenlm

from transformers import AutoProcessor, AutoModelForCTC, pipeline
import torch, librosa

# Load processor and model
processor = AutoProcessor.from_pretrained("panjiariputra/multilingual-xls_r_300m-LARGE-5gram")
model = AutoModelForCTC.from_pretrained("panjiariputra/multilingual-xls_r_300m-LARGE-5gram")

# Initialize ASR pipeline
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=0 if torch.cuda.is_available() else -1
)

# Load an audio file (must be 16kHz, mono)
audio_path = "path/to/your/audio.wav"
speech_array, sampling_rate = librosa.load(audio_path, sr=16000)

# Run transcription
transcription = asr_pipeline(speech_array)
print(transcription)  # {'text': 'your transcribed text here'}

With Language Model Integration (Recommended)

For the best accuracy and lowest Word Error Rate (WER), use pyctcdecode with the 5-gram KenLM model (5gram.arpa) available in this repository.

from pyctcdecode import build_ctcdecoder
import torch, librosa

# Get vocabulary and build the decoder
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab = {k: v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

decoder = build_ctcdecoder(
    labels=list(sorted_vocab.keys()),
    kenlm_model_path="path/to/your/5gram.arpa"
)

# Load audio (16kHz, mono)
audio_path = "path/to/your/audio.wav"
speech_array, _ = librosa.load(audio_path, sr=16000)

# Get model logits
with torch.no_grad():
    inputs = processor(speech_array, sampling_rate=16000, return_tensors="pt", padding=True)
    logits = model(**inputs).logits.cpu().numpy()[0]

# Decode using KenLM
lm_transcription = decoder.decode(logits)
print({"text": lm_transcription})  # {'text': 'your more accurate transcribed text here'}

Publication and Citation

This work was published in ICIC Express Letters, Part B: Applications. You can download the full paper here. If you use this model or the findings from the paper in your research, please cite:

@article{Arisaputra2024XLS,
  author    = {Panji Arisaputra and Alif Tri Handoyo and Amalia Zahra},
  title     = {XLS-R DEEP LEARNING MODEL FOR MULTILINGUAL ASR ON LOW-RESOURCE LANGUAGES: INDONESIAN, JAVANESE, AND SUNDANESE},
  journal   = {ICIC Express Letters, Part B: Applications},
  volume    = {15},
  number    = {6},
  pages     = {551--559},
  year      = {2024},
  doi       = {10.24507/icicelb.15.06.551},
  issn      = {2185-2766}
}

panjiariputra
/

multilingual-xls_r_300m-LARGE-5gram