Indonesian Automatic Speech Recognition with XLSR-53

A fine-tuned model for Automatic Speech Recognition (ASR) in Indonesian, achieving competitive performance with a significantly reduced Word Error Rate (WER) using a KenLM language model.

How to Use · Evaluation Results · Citation · Try on Spaces · Read the Paper

This repository contains the official fine-tuned model from the research paper "Indonesian Automatic Speech Recognition with XLSR-53". The study focuses on developing a robust Indonesian ASR system by fine-tuning the pre-trained cross-lingual XLSR-53 (facebook/wav2vec2-large-xlsr-53) model.

The key contribution of this work is demonstrating that a competitive Word Error Rate (WER) can be achieved with a relatively small dataset (24 hours). The model's accuracy is significantly boosted by integrating a 4-gram KenLM language model, which successfully reduces the WER from 20% to 12% on the Common Voice test set.

Proposed Methodology

Model Details

Base Model: This model is built upon the wav2vec 2.0 architecture, specifically the XLSR-53 pre-trained model (facebook/wav2vec2-large-xlsr-53) which was trained on 53 languages.
Task: Automatic Speech Recognition (ASR).
Language: Indonesian (id).
Library: Transformers.
Framework: The approach involves fine-tuning the pre-trained model using a Connectionist Temporal Classification (CTC) loss function.

Authors

Panji Arisaputra
Amalia Zahra

Computer Science Department, BINUS Graduate Program, Bina Nusantara University, Jakarta, Indonesia.

Datasets Used for Training

A total of three speech datasets were combined to fine-tune the model, and an additional large text corpus was used to build the language model.

Speech Data for Fine-Tuning:

The total combined duration of speech data is 24 hours, 18 minutes, and 1 second.

TITML-IDN: A clean speech corpus containing 14.5 hours of audio from 20 speakers reading phonetically balanced sentences.
Magic Data: A 3.5-hour corpus of scripted daily-use sentences from 10 speakers, recorded in various environments.
Common Voice (Indonesian): A crowdsourced dataset containing ~6.2 hours of speech from 170 speakers in diverse, non-clean environments.

Text Data for Language Model:

In addition to the transcripts from the speech datasets, the OSCAR corpus (unshuffled_deduplicated_id subset) was used to build the KenLM language model. To ensure a balanced vocabulary, only 6% of its 2.3 billion Indonesian words were included.

Data Preprocessing

The datasets underwent a standardized preprocessing pipeline:

Data Splitting: Datasets were split into training (90%) and validation (10%) sets.
Audio Standardization: All audio files were converted to WAV format with a single channel and resampled to a 16 kHz sampling rate to match the pre-trained model's requirements.
Text Normalization: Transcriptions were cleaned by removing special characters and converting all text to lowercase to create a unified vocabulary.

Evaluation and Results

The model was evaluated against a similar model from a previous study by Syahputra & Zahra (2021), using the Word Error Rate (WER) metric. The evaluation on the Common Voice test split serves as the primary benchmark.

The results show that this XLSR-53 model outperforms the previous wav2vec 2.0-based model. The integration of a 4-gram KenLM language model was crucial, providing an 8% absolute reduction in WER (from 20% down to 12%).

Model	Data Training & Validation	Language Model	Test Set	WER (%)
This Study (XLSR-53)	TITML-IDN + Magic Data + Common Voice (24h 18m)	—	Common Voice	20.306%
This Study (XLSR-53)	TITML-IDN + Magic Data + Common Voice (24h 18m)	4-gram KenLM	Common Voice	12,213%
Benchmark (Syahputra & Zahra, 2021)	BahasaKita batch 10 – 12 (75h)	—	Common Voice	21.000%
Benchmark (Syahputra & Zahra, 2021)	BahasaKita batch 10 – 12 (75h)	3-gram KenLM	Common Voice	41.000%

*WER results extracted from Table 3 of the research paper. The benchmark model's high WER with LM is noted in the paper.

How to Use

You can use this model with the transformers library pipeline. For optimal performance, as demonstrated in the research paper, we strongly recommend integrating the provided 4-gram KenLM language model.

pip install transformers torch torchaudio librosa
# For decoding with the language model:
pip install pyctcdecode==0.4.0 kenlm

Without Language Model

from transformers import AutoProcessor, AutoModelForCTC, pipeline
import torch
import librosa

# Load processor and model
processor = AutoProcessor.from_pretrained("panjiarisaputra/indonesian-asr-xlsr-53")
model = AutoModelForCTC.from_pretrained("panjiarisaputra/indonesian-asr-xlsr-53")
# Initialize ASR pipeline
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=0 if torch.cuda.is_available() else -1
)
# Load an audio file (must be 16kHz, mono)
audio_path = "path/to/your/audio.wav"
speech_array, sampling_rate = librosa.load(audio_path, sr=16000)
# Run transcription
transcription = asr_pipeline(speech_array)
print(transcription)
# Output: {'text': '...transcribed text...'}

With Language Model Integration (Recommended)

For the best accuracy and lowest Word Error Rate (WER), use pyctcdecode with the 4-gram KenLM model (e.g., 4gram.arpa) created during the research.

from transformers import AutoProcessor, AutoModelForCTC
from pyctcdecode import build_ctcdecoder
import torch
import librosa

# Load processor and model
processor = AutoProcessor.from_pretrained("panjiariputra/indonesian-xlsr_53-LARGE-4gram")
model = AutoModelForCTC.from_pretrained("panjiariputra/indonesian-xlsr_53-LARGE-4gram")

# Get vocabulary and build the decoder with the language model
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k: v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

decoder = build_ctcdecoder(
    labels=list(sorted_vocab_dict.keys()),
    kenlm_model_path="path/to/your/4gram.arpa"  # Path to your KenLM model
)

# Load audio (16kHz, mono)
audio_path = "path/to/your/audio.wav"
speech_array, _ = librosa.load(audio_path, sr=16000)

# Get model logits
with torch.no_grad():
    inputs = processor(speech_array, sampling_rate=16000, return_tensors="pt", padding=True)
    logits = model(**inputs).logits.cpu().numpy()[0]

# Decode using KenLM
lm_transcription = decoder.decode(logits)
print({"text": lm_transcription})
# Output: {'text': '...more accurate transcribed text...'}

Publication and Citation

This work was published in Ingénierie des Systèmes d'Information, Vol. 27, No. 6, December, 2022. You can download the full paper here. If you use this model or the findings from the paper in your research, please cite:

@article{Arisaputra2022XLSR53,
  author    = {Panji Arisaputra and Amalia Zahra},
  title     = {Indonesian Automatic Speech Recognition with XLSR-53},
  journal   = {Ingénierie des Systèmes d'Information},
  volume    = {27},
  number    = {6},
  pages     = {973--982},
  year      = {2022},
  doi       = {10.18280/isi.270614}
}

panjiariputra
/

indonesian-xlsr_53-LARGE-4gram