---
language:
- en
- it
- es
- de
- fr
pipeline_tag: automatic-speech-recognition
license: cc-by-4.0
---
## Model Details

### Model Description

A 17.31M parameter multilingual linear projector trained for automatic speech recognition (ASR) using the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) speechLLM framework. 
Within this framework, only the linear projector was trained alongside a frozen speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
and frozen LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)). 

- **Developed by:** SpeechTek Unit at Fondazione Bruno Kessler
- **Funded by:** This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
- **Model type:** Linear projector in a speechLLM framework
- **Supported Language(s):** English, Italian, Spanish, German, French
- **License:** CC-BY-4.0

## Uses

This model is trained for Automatic Speech Recognition (ASR).

## How to Get Started with the Model

This linear projector checkpoint can be downloaded and utilised for further finetuning or decoding using the shell scripts provided in the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) codebase. Kindly refer to the instructions there for further details.

Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector.

## Training Details

### Training Data

The linear projector was trained with a total of 500 hours of data from [Common Voice 20.0](https://commonvoice.mozilla.org/) and [Fleurs](https://huggingface.co/datasets/google/fleurs), covering 5 languages (English, Italian, Spanish, German, and French).
Specifically, the training set consisted of 92.5 hours of Common Voice data + 7.5 hours of Fleurs data per language, while the validation set consisted of 47 minutes of Common Voice data + 47 minutes of Fleurs data per language. 

### Training Procedure

* The model was trained using the code-based provided by the official [SLAM-ASR Github repository](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) with `torchrun`.
* Only the linear projector was trained.
* The whisper-large-v3-turbo speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
and LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)) were kept frozen.
* No prompt was used during training and inference. 
* Training was conducted with one NVIDIA Ada Lovelace L40S GPU.

#### Training Hyperparameters

|           |        |
| -------- | ------- |
| llm_name  | eurollm-1.7b |
| llm_dim | 2048     |
| context_length | 4096 |
| encoder_name    | whisper    |
| encoder_projector_ds_rate    | 5    |
| encoder_dim    | 1280    |
| encoder_projector    | linear    |
| input_type   | mel    |
| mel_size  | 128    |
| epochs  | 6    |
| freeze_encoder  | true    |
| freeze_llm | true    |
| warmup_steps | 1000    |
| total_steps | 100000    |
| lr | 1e-4    |
| validation_interval | 1000    |
| batch_size_training | 4    |
| val_size_training | 4    |
| num_workers_dataloader | 2   |
| optimizer | AdamW   |
| enable_fdsp | false   |
| enable_ddp | true   |
| use_fp16 | true   |

## Evaluation

The model was evaluated using the Word Error Rate (WER) metric from the `evaluate` library. 
Prior to computing the WER, preprocessing of ground-truth and predicted transcripts was carried out using the `Whisper EnglishTextNormalizer` for English and `BasicTextNormalizer` for all other languages. 
Beam search decoding is used with `beam size = 4`.

### Results

| Dataset  | Language  | WER (%) &#8595;|
| -------- | ------- | ------- |
| Common Voice 20.0  | English | 13.5 |
| Fleurs  | English | 5.5 |
| Common Voice 20.0  | Italian | 6.4 |
| Fleurs  | Italian | 5.8 |
| Common Voice 20.0  | Spanish | 6.0 |
| Fleurs  | Spanish | 4.3 |
| Common Voice 20.0  | German | 8.8 |
| Fleurs  | German | 10.3 |
 Common Voice 20.0  | French | 11.5 |
| Fleurs  | French | 8.1 |

## Acknowledgements
<img src="images/eloquence_eu.png" align="center" width="30%">
This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).

## Citation 

**BibTeX:**

Please cite the associated Interspeech 2025 paper when using this model:

```
@inproceedings{fong25_interspeech,
  title     = {{Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages}},
  author    = {{Seraphina Fong and Marco Matassoni and Alessio Brutti}},
  year      = {{2025}},
  booktitle = {{Interspeech 2025}},
  pages     = {{2003--2007}},
  doi       = {{10.21437/Interspeech.2025-764}},
}
```