|
--- |
|
language: |
|
- en |
|
- it |
|
- es |
|
- de |
|
- fr |
|
pipeline_tag: automatic-speech-recognition |
|
license: cc-by-4.0 |
|
--- |
|
## Model Details |
|
|
|
### Model Description |
|
|
|
A 17.31M parameter multilingual linear projector trained for automatic speech recognition (ASR) using the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) speechLLM framework. |
|
Within this framework, only the linear projector was trained alongside a frozen speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)) |
|
and frozen LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)). |
|
|
|
- **Developed by:** SpeechTek Unit at Fondazione Bruno Kessler |
|
- **Funded by:** This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558). |
|
- **Model type:** Linear projector in a speechLLM framework |
|
- **Supported Language(s):** English, Italian, Spanish, German, French |
|
- **License:** CC-BY-4.0 |
|
|
|
## Uses |
|
|
|
This model is trained for Automatic Speech Recognition (ASR). |
|
|
|
## How to Get Started with the Model |
|
|
|
This linear projector checkpoint can be downloaded and utilised for further finetuning or decoding using the shell scripts provided in the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) codebase. Kindly refer to the instructions there for further details. |
|
|
|
Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector. |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The linear projector was trained with a total of 500 hours of data from [Common Voice 20.0](https://commonvoice.mozilla.org/) and [Fleurs](https://huggingface.co/datasets/google/fleurs), covering 5 languages (English, Italian, Spanish, German, and French). |
|
Specifically, the training set consisted of 92.5 hours of Common Voice data + 7.5 hours of Fleurs data per language, while the validation set consisted of 47 minutes of Common Voice data + 47 minutes of Fleurs data per language. |
|
|
|
### Training Procedure |
|
|
|
* The model was trained using the code-based provided by the official [SLAM-ASR Github repository](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) with `torchrun`. |
|
* Only the linear projector was trained. |
|
* The whisper-large-v3-turbo speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)) |
|
and LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)) were kept frozen. |
|
* No prompt was used during training and inference. |
|
* Training was conducted with one NVIDIA Ada Lovelace L40S GPU. |
|
|
|
#### Training Hyperparameters |
|
|
|
| | | |
|
| -------- | ------- | |
|
| llm_name | eurollm-1.7b | |
|
| llm_dim | 2048 | |
|
| context_length | 4096 | |
|
| encoder_name | whisper | |
|
| encoder_projector_ds_rate | 5 | |
|
| encoder_dim | 1280 | |
|
| encoder_projector | linear | |
|
| input_type | mel | |
|
| mel_size | 128 | |
|
| epochs | 6 | |
|
| freeze_encoder | true | |
|
| freeze_llm | true | |
|
| warmup_steps | 1000 | |
|
| total_steps | 100000 | |
|
| lr | 1e-4 | |
|
| validation_interval | 1000 | |
|
| batch_size_training | 4 | |
|
| val_size_training | 4 | |
|
| num_workers_dataloader | 2 | |
|
| optimizer | AdamW | |
|
| enable_fdsp | false | |
|
| enable_ddp | true | |
|
| use_fp16 | true | |
|
|
|
## Evaluation |
|
|
|
The model was evaluated using the Word Error Rate (WER) metric from the `evaluate` library. |
|
Prior to computing the WER, preprocessing of ground-truth and predicted transcripts was carried out using the `Whisper EnglishTextNormalizer` for English and `BasicTextNormalizer` for all other languages. |
|
Beam search decoding is used with `beam size = 4`. |
|
|
|
### Results |
|
|
|
| Dataset | Language | WER (%) ↓| |
|
| -------- | ------- | ------- | |
|
| Common Voice 20.0 | English | 13.5 | |
|
| Fleurs | English | 5.5 | |
|
| Common Voice 20.0 | Italian | 6.4 | |
|
| Fleurs | Italian | 5.8 | |
|
| Common Voice 20.0 | Spanish | 6.0 | |
|
| Fleurs | Spanish | 4.3 | |
|
| Common Voice 20.0 | German | 8.8 | |
|
| Fleurs | German | 10.3 | |
|
Common Voice 20.0 | French | 11.5 | |
|
| Fleurs | French | 8.1 | |
|
|
|
## Acknowledgements |
|
<img src="images/eloquence_eu.png" align="center" width="30%"> |
|
This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558). |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
Please cite the associated Interspeech 2025 paper when using this model: |
|
|
|
``` |
|
@inproceedings{fong25_interspeech, |
|
title = {{Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages}}, |
|
author = {{Seraphina Fong and Marco Matassoni and Alessio Brutti}}, |
|
year = {{2025}}, |
|
booktitle = {{Interspeech 2025}}, |
|
pages = {{2003--2007}}, |
|
doi = {{10.21437/Interspeech.2025-764}}, |
|
} |
|
``` |