Added official BibTeX Interspeech citation in README

4198394 verified about 1 month ago

4.77 kB

	---
	language:
	- en
	- it
	- es
	- de
	- fr
	pipeline_tag: automatic-speech-recognition
	license: cc-by-4.0
	---
	## Model Details

	### Model Description

	A 17.31M parameter multilingual linear projector trained for automatic speech recognition (ASR) using the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) speechLLM framework.
	Within this framework, only the linear projector was trained alongside a frozen speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
	and frozen LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)).

	- Developed by: SpeechTek Unit at Fondazione Bruno Kessler
	- Funded by: This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
	- Model type: Linear projector in a speechLLM framework
	- Supported Language(s): English, Italian, Spanish, German, French
	- License: CC-BY-4.0

	## Uses

	This model is trained for Automatic Speech Recognition (ASR).

	## How to Get Started with the Model

	This linear projector checkpoint can be downloaded and utilised for further finetuning or decoding using the shell scripts provided in the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) codebase. Kindly refer to the instructions there for further details.

	Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector.

	## Training Details

	### Training Data

	The linear projector was trained with a total of 500 hours of data from [Common Voice 20.0](https://commonvoice.mozilla.org/) and [Fleurs](https://huggingface.co/datasets/google/fleurs), covering 5 languages (English, Italian, Spanish, German, and French).
	Specifically, the training set consisted of 92.5 hours of Common Voice data + 7.5 hours of Fleurs data per language, while the validation set consisted of 47 minutes of Common Voice data + 47 minutes of Fleurs data per language.

	### Training Procedure

	* The model was trained using the code-based provided by the official [SLAM-ASR Github repository](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) with `torchrun`.
	* Only the linear projector was trained.
	* The whisper-large-v3-turbo speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
	and LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)) were kept frozen.
	* No prompt was used during training and inference.
	* Training was conducted with one NVIDIA Ada Lovelace L40S GPU.

	#### Training Hyperparameters

	\| \| \|
	\| -------- \| ------- \|
	\| llm_name \| eurollm-1.7b \|
	\| llm_dim \| 2048 \|
	\| context_length \| 4096 \|
	\| encoder_name \| whisper \|
	\| encoder_projector_ds_rate \| 5 \|
	\| encoder_dim \| 1280 \|
	\| encoder_projector \| linear \|
	\| input_type \| mel \|
	\| mel_size \| 128 \|
	\| epochs \| 6 \|
	\| freeze_encoder \| true \|
	\| freeze_llm \| true \|
	\| warmup_steps \| 1000 \|
	\| total_steps \| 100000 \|
	\| lr \| 1e-4 \|
	\| validation_interval \| 1000 \|
	\| batch_size_training \| 4 \|
	\| val_size_training \| 4 \|
	\| num_workers_dataloader \| 2 \|
	\| optimizer \| AdamW \|
	\| enable_fdsp \| false \|
	\| enable_ddp \| true \|
	\| use_fp16 \| true \|

	## Evaluation

	The model was evaluated using the Word Error Rate (WER) metric from the `evaluate` library.
	Prior to computing the WER, preprocessing of ground-truth and predicted transcripts was carried out using the `Whisper EnglishTextNormalizer` for English and `BasicTextNormalizer` for all other languages.
	Beam search decoding is used with `beam size = 4`.

	### Results

	\| Dataset \| Language \| WER (%) ↓\|
	\| -------- \| ------- \| ------- \|
	\| Common Voice 20.0 \| English \| 13.5 \|
	\| Fleurs \| English \| 5.5 \|
	\| Common Voice 20.0 \| Italian \| 6.4 \|
	\| Fleurs \| Italian \| 5.8 \|
	\| Common Voice 20.0 \| Spanish \| 6.0 \|
	\| Fleurs \| Spanish \| 4.3 \|
	\| Common Voice 20.0 \| German \| 8.8 \|
	\| Fleurs \| German \| 10.3 \|
	Common Voice 20.0 \| French \| 11.5 \|
	\| Fleurs \| French \| 8.1 \|

	## Acknowledgements
	<img src="images/eloquence_eu.png" align="center" width="30%">
	This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).

	## Citation

	BibTeX:

	Please cite the associated Interspeech 2025 paper when using this model:

	```
	@inproceedings{fong25_interspeech,
	title = {{Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages}},
	author = {{Seraphina Fong and Marco Matassoni and Alessio Brutti}},
	year = {{2025}},
	booktitle = {{Interspeech 2025}},
	pages = {{2003--2007}},
	doi = {{10.21437/Interspeech.2025-764}},
	}
	```