File size: 4,769 Bytes
7924637 4d92225 7924637 3472aef 81daced 7924637 4d92225 7924637 5140662 7924637 f2bf3bb 81daced f2bf3bb 7924637 f2bf3bb 81daced 7924637 00d1f00 d7f3346 7924637 81daced 22b516c 81daced 7924637 7ef5556 3472aef 7924637 4198394 a8e23ef 3472aef 4198394 3472aef |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
language:
- en
- it
- es
- de
- fr
pipeline_tag: automatic-speech-recognition
license: cc-by-4.0
---
## Model Details
### Model Description
A 17.31M parameter multilingual linear projector trained for automatic speech recognition (ASR) using the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) speechLLM framework.
Within this framework, only the linear projector was trained alongside a frozen speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
and frozen LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)).
- **Developed by:** SpeechTek Unit at Fondazione Bruno Kessler
- **Funded by:** This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
- **Model type:** Linear projector in a speechLLM framework
- **Supported Language(s):** English, Italian, Spanish, German, French
- **License:** CC-BY-4.0
## Uses
This model is trained for Automatic Speech Recognition (ASR).
## How to Get Started with the Model
This linear projector checkpoint can be downloaded and utilised for further finetuning or decoding using the shell scripts provided in the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) codebase. Kindly refer to the instructions there for further details.
Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector.
## Training Details
### Training Data
The linear projector was trained with a total of 500 hours of data from [Common Voice 20.0](https://commonvoice.mozilla.org/) and [Fleurs](https://huggingface.co/datasets/google/fleurs), covering 5 languages (English, Italian, Spanish, German, and French).
Specifically, the training set consisted of 92.5 hours of Common Voice data + 7.5 hours of Fleurs data per language, while the validation set consisted of 47 minutes of Common Voice data + 47 minutes of Fleurs data per language.
### Training Procedure
* The model was trained using the code-based provided by the official [SLAM-ASR Github repository](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) with `torchrun`.
* Only the linear projector was trained.
* The whisper-large-v3-turbo speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
and LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)) were kept frozen.
* No prompt was used during training and inference.
* Training was conducted with one NVIDIA Ada Lovelace L40S GPU.
#### Training Hyperparameters
| | |
| -------- | ------- |
| llm_name | eurollm-1.7b |
| llm_dim | 2048 |
| context_length | 4096 |
| encoder_name | whisper |
| encoder_projector_ds_rate | 5 |
| encoder_dim | 1280 |
| encoder_projector | linear |
| input_type | mel |
| mel_size | 128 |
| epochs | 6 |
| freeze_encoder | true |
| freeze_llm | true |
| warmup_steps | 1000 |
| total_steps | 100000 |
| lr | 1e-4 |
| validation_interval | 1000 |
| batch_size_training | 4 |
| val_size_training | 4 |
| num_workers_dataloader | 2 |
| optimizer | AdamW |
| enable_fdsp | false |
| enable_ddp | true |
| use_fp16 | true |
## Evaluation
The model was evaluated using the Word Error Rate (WER) metric from the `evaluate` library.
Prior to computing the WER, preprocessing of ground-truth and predicted transcripts was carried out using the `Whisper EnglishTextNormalizer` for English and `BasicTextNormalizer` for all other languages.
Beam search decoding is used with `beam size = 4`.
### Results
| Dataset | Language | WER (%) ↓|
| -------- | ------- | ------- |
| Common Voice 20.0 | English | 13.5 |
| Fleurs | English | 5.5 |
| Common Voice 20.0 | Italian | 6.4 |
| Fleurs | Italian | 5.8 |
| Common Voice 20.0 | Spanish | 6.0 |
| Fleurs | Spanish | 4.3 |
| Common Voice 20.0 | German | 8.8 |
| Fleurs | German | 10.3 |
Common Voice 20.0 | French | 11.5 |
| Fleurs | French | 8.1 |
## Acknowledgements
<img src="images/eloquence_eu.png" align="center" width="30%">
This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
## Citation
**BibTeX:**
Please cite the associated Interspeech 2025 paper when using this model:
```
@inproceedings{fong25_interspeech,
title = {{Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages}},
author = {{Seraphina Fong and Marco Matassoni and Alessio Brutti}},
year = {{2025}},
booktitle = {{Interspeech 2025}},
pages = {{2003--2007}},
doi = {{10.21437/Interspeech.2025-764}},
}
``` |