XLS-R Deep Learning Model for Multilingual ASR
A fine-tuned model for Automatic Speech Recognition on low-resource Southeast Asian languages: Indonesian, Javanese, and Sundanese.
How to Use · Evaluation Results · Citation · Try on Spaces · Read the Paper
This is a fine-tuned version of the facebook/wav2vec2-xls-r-300m model for Automatic Speech Recognition (ASR). It is the official model from the research paper "XLS-R DEEP LEARNING MODEL FOR MULTILINGUAL ASR ON LOW-RESOURCE LANGUAGES: INDONESIAN, JAVANESE, AND SUNDANESE".
The goal of this research is to improve ASR performance in converting spoken language into written text for Indonesian, Javanese, and Sundanese. The model's accuracy is significantly enhanced by integrating a 5-gram KenLM language model, which substantially reduces the Word Error Rate (WER).
Proposed Methodology
Model Details
- Base Model:
facebook/wav2vec2-xls-r-300m
(a version with 300 million parameters was used). - Task: Multilingual Automatic Speech Recognition (ASR).
- Languages: Indonesian, Javanese, Sundanese.
- Library: Transformers.
- Framework: Deep Learning, based on the wav2vec 2.0 and Transformer architecture.
Authors
- Panji Arisaputra
- Alif Tri Handoyo
- Amalia Zahra
Computer Science Department, Bina Nusantara University, Jakarta, Indonesia.
Datasets Used for Training
A total of seven datasets were combined for this study to create a robust multilingual corpus.
Speech Data for Fine-Tuning:
- Indonesian:
- TITML-IDN: A phonetically balanced collection of 343 sentences from 20 speakers, totaling 14.5 hours of audio.
- Magic Data Corpus: 3.5 hours of scripted speeches from 10 Indonesian speakers.
- Common Voice (Indonesian): Utilized the train, validation, and test subsets, comprising 5,809 instances from 170 individuals.
- Javanese & Sundanese:
- OpenSLR SLR35 & SLR36: Large ASR training datasets with speech recordings from native speakers. Due to computational limits, only the first three .zip files from each dataset were used.
- OpenSLR SLR41 & SLR44: High-quality Text-to-Speech (TTS) data used for ASR training.
Text Data for Language Model:
- In addition to the transcripts from the above datasets, the OSCAR corpus (unshuffled_deduplicated id subset) was used as an augmentation technique for building the KenLM language model. Only 6% of its 2.3 billion Indonesian words were used to avoid disproportionately influencing the model.
Data Preprocessing
The datasets underwent a standardized preprocessing pipeline:
- Data Splitting: Each dataset was split into 90% for training and 10% for testing. The training portion was further subdivided into 90% for the train set and 10% for the validation set.
- Audio Standardization: Audio files were converted to WAV format with a single channel and resampled to a 16 kHz sampling rate.
- Text Normalization: Transcriptions were cleaned by removing special characters and converting all text to lowercase.
Evaluation and Results
The model was evaluated against a previous model, XLSR-53, using the Word Error Rate (WER) metric. The integration of a 5-gram KenLM language model demonstrated the best overall performance, achieving a significant reduction in WER and establishing a new benchmark for these languages.
The XLS-R 300m model maintains a competitive edge by supporting Javanese and Sundanese in addition to Indonesian, which the previous XLSR-53 model did not.
Model | Data train & val | KenLM | Data testing | |||||||
---|---|---|---|---|---|---|---|---|---|---|
TITML-IDN | Magic-Data | Common Voice | SLR 35 | SLR 36 | SLR 41 | SLR 44 | AVG | |||
XLS-R 300m ASR multi-lingual model |
|
— | 7.73 | 19.64 | 15.30 | 17.95 | 2.39 | 21.99 | 7.10 | 13.16 |
2-gram | 1.79 | 10.93 | 6.55 | 7.76 | 1.20 | 10.90 | 3.58 | 6.10 | ||
3-gram | 1.39 | 10.38 | 5.63 | 6.50 | 1.15 | 10.41 | 3.47 | 5.56 | ||
4-gram | 1.37 | 10.38 | 5.11 | 6.38 | 1.15 | 10.31 | 3.47 | 5.45 | ||
5-gram | 1.37 | 10.38 | 4.99 | 6.41 | 1.14 | 10.25 | 3.44 | 5.43 | ||
6-gram | 1.37 | 10.38 | 5.01 | 6.41 | 1.14 | 10.35 | 3.44 | 5.44 | ||
XLSR-53 ASR model |
|
— | 2.17 | 16.75 | — | — | — | — | — | 9.46 |
2-gram | 0.77 | 10.78 | — | — | — | — | — | 5.77 | ||
3-gram | 0.72 | 10.88 | — | — | — | — | — | 5.80 | ||
4-gram | 0.72 | 10.88 | — | — | — | — | — | 5.80 | ||
5-gram | 0.72 | 10.93 | — | — | — | — | — | 5.82 |
*WER results extracted from Table 4 of the research paper.
How to Use
You can use this model with the transformers
library pipeline. For optimal performance, as demonstrated in the research paper, we strongly recommend integrating the provided 5-gram KenLM language model.
pip install transformers torch torchaudio librosa
# For decoding with the language model:
pip install pyctcdecode==0.4.0 kenlm
from transformers import AutoProcessor, AutoModelForCTC, pipeline
import torch, librosa
# Load processor and model
processor = AutoProcessor.from_pretrained("panjiariputra/multilingual-xls_r_300m-LARGE-5gram")
model = AutoModelForCTC.from_pretrained("panjiariputra/multilingual-xls_r_300m-LARGE-5gram")
# Initialize ASR pipeline
asr_pipeline = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=0 if torch.cuda.is_available() else -1
)
# Load an audio file (must be 16kHz, mono)
audio_path = "path/to/your/audio.wav"
speech_array, sampling_rate = librosa.load(audio_path, sr=16000)
# Run transcription
transcription = asr_pipeline(speech_array)
print(transcription) # {'text': 'your transcribed text here'}
With Language Model Integration (Recommended)
For the best accuracy and lowest Word Error Rate (WER), use pyctcdecode
with the 5-gram KenLM model (5gram.arpa
) available in this repository.
from pyctcdecode import build_ctcdecoder
import torch, librosa
# Get vocabulary and build the decoder
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab = {k: v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}
decoder = build_ctcdecoder(
labels=list(sorted_vocab.keys()),
kenlm_model_path="path/to/your/5gram.arpa"
)
# Load audio (16kHz, mono)
audio_path = "path/to/your/audio.wav"
speech_array, _ = librosa.load(audio_path, sr=16000)
# Get model logits
with torch.no_grad():
inputs = processor(speech_array, sampling_rate=16000, return_tensors="pt", padding=True)
logits = model(**inputs).logits.cpu().numpy()[0]
# Decode using KenLM
lm_transcription = decoder.decode(logits)
print({"text": lm_transcription}) # {'text': 'your more accurate transcribed text here'}
Publication and Citation
This work was published in ICIC Express Letters, Part B: Applications. You can download the full paper here. If you use this model or the findings from the paper in your research, please cite:
@article{Arisaputra2024XLS,
author = {Panji Arisaputra and Alif Tri Handoyo and Amalia Zahra},
title = {XLS-R DEEP LEARNING MODEL FOR MULTILINGUAL ASR ON LOW-RESOURCE LANGUAGES: INDONESIAN, JAVANESE, AND SUNDANESE},
journal = {ICIC Express Letters, Part B: Applications},
volume = {15},
number = {6},
pages = {551--559},
year = {2024},
doi = {10.24507/icicelb.15.06.551},
issn = {2185-2766}
}
- Downloads last month
- 19
Model tree for panjiariputra/multilingual-xls_r_300m-LARGE-5gram
Base model
facebook/wav2vec2-xls-r-300m