seraphina's picture
Added official BibTeX Interspeech citation in README
4198394 verified
---
language:
- en
- it
- es
- de
- fr
pipeline_tag: automatic-speech-recognition
license: cc-by-4.0
---
## Model Details
### Model Description
A 17.31M parameter multilingual linear projector trained for automatic speech recognition (ASR) using the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) speechLLM framework.
Within this framework, only the linear projector was trained alongside a frozen speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
and frozen LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)).
- **Developed by:** SpeechTek Unit at Fondazione Bruno Kessler
- **Funded by:** This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
- **Model type:** Linear projector in a speechLLM framework
- **Supported Language(s):** English, Italian, Spanish, German, French
- **License:** CC-BY-4.0
## Uses
This model is trained for Automatic Speech Recognition (ASR).
## How to Get Started with the Model
This linear projector checkpoint can be downloaded and utilised for further finetuning or decoding using the shell scripts provided in the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) codebase. Kindly refer to the instructions there for further details.
Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector.
## Training Details
### Training Data
The linear projector was trained with a total of 500 hours of data from [Common Voice 20.0](https://commonvoice.mozilla.org/) and [Fleurs](https://huggingface.co/datasets/google/fleurs), covering 5 languages (English, Italian, Spanish, German, and French).
Specifically, the training set consisted of 92.5 hours of Common Voice data + 7.5 hours of Fleurs data per language, while the validation set consisted of 47 minutes of Common Voice data + 47 minutes of Fleurs data per language.
### Training Procedure
* The model was trained using the code-based provided by the official [SLAM-ASR Github repository](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) with `torchrun`.
* Only the linear projector was trained.
* The whisper-large-v3-turbo speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
and LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)) were kept frozen.
* No prompt was used during training and inference.
* Training was conducted with one NVIDIA Ada Lovelace L40S GPU.
#### Training Hyperparameters
| | |
| -------- | ------- |
| llm_name | eurollm-1.7b |
| llm_dim | 2048 |
| context_length | 4096 |
| encoder_name | whisper |
| encoder_projector_ds_rate | 5 |
| encoder_dim | 1280 |
| encoder_projector | linear |
| input_type | mel |
| mel_size | 128 |
| epochs | 6 |
| freeze_encoder | true |
| freeze_llm | true |
| warmup_steps | 1000 |
| total_steps | 100000 |
| lr | 1e-4 |
| validation_interval | 1000 |
| batch_size_training | 4 |
| val_size_training | 4 |
| num_workers_dataloader | 2 |
| optimizer | AdamW |
| enable_fdsp | false |
| enable_ddp | true |
| use_fp16 | true |
## Evaluation
The model was evaluated using the Word Error Rate (WER) metric from the `evaluate` library.
Prior to computing the WER, preprocessing of ground-truth and predicted transcripts was carried out using the `Whisper EnglishTextNormalizer` for English and `BasicTextNormalizer` for all other languages.
Beam search decoding is used with `beam size = 4`.
### Results
| Dataset | Language | WER (%) ↓|
| -------- | ------- | ------- |
| Common Voice 20.0 | English | 13.5 |
| Fleurs | English | 5.5 |
| Common Voice 20.0 | Italian | 6.4 |
| Fleurs | Italian | 5.8 |
| Common Voice 20.0 | Spanish | 6.0 |
| Fleurs | Spanish | 4.3 |
| Common Voice 20.0 | German | 8.8 |
| Fleurs | German | 10.3 |
Common Voice 20.0 | French | 11.5 |
| Fleurs | French | 8.1 |
## Acknowledgements
<img src="images/eloquence_eu.png" align="center" width="30%">
This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
## Citation
**BibTeX:**
Please cite the associated Interspeech 2025 paper when using this model:
```
@inproceedings{fong25_interspeech,
title = {{Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages}},
author = {{Seraphina Fong and Marco Matassoni and Alessio Brutti}},
year = {{2025}},
booktitle = {{Interspeech 2025}},
pages = {{2003--2007}},
doi = {{10.21437/Interspeech.2025-764}},
}
```