File size: 4,769 Bytes
7924637
 
 
 
 
 
 
 
4d92225
7924637
 
 
 
 
3472aef
81daced
7924637
 
 
 
 
 
4d92225
7924637
 
 
 
 
 
 
5140662
7924637
 
 
 
 
 
 
 
 
 
 
 
f2bf3bb
81daced
f2bf3bb
7924637
f2bf3bb
81daced
7924637
 
 
00d1f00
d7f3346
7924637
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81daced
 
22b516c
81daced
7924637
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ef5556
 
 
 
3472aef
7924637
 
 
4198394
a8e23ef
3472aef
4198394
 
 
 
 
 
 
3472aef
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
language:
- en
- it
- es
- de
- fr
pipeline_tag: automatic-speech-recognition
license: cc-by-4.0
---
## Model Details

### Model Description

A 17.31M parameter multilingual linear projector trained for automatic speech recognition (ASR) using the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) speechLLM framework. 
Within this framework, only the linear projector was trained alongside a frozen speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
and frozen LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)). 

- **Developed by:** SpeechTek Unit at Fondazione Bruno Kessler
- **Funded by:** This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
- **Model type:** Linear projector in a speechLLM framework
- **Supported Language(s):** English, Italian, Spanish, German, French
- **License:** CC-BY-4.0

## Uses

This model is trained for Automatic Speech Recognition (ASR).

## How to Get Started with the Model

This linear projector checkpoint can be downloaded and utilised for further finetuning or decoding using the shell scripts provided in the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) codebase. Kindly refer to the instructions there for further details.

Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector.

## Training Details

### Training Data

The linear projector was trained with a total of 500 hours of data from [Common Voice 20.0](https://commonvoice.mozilla.org/) and [Fleurs](https://huggingface.co/datasets/google/fleurs), covering 5 languages (English, Italian, Spanish, German, and French).
Specifically, the training set consisted of 92.5 hours of Common Voice data + 7.5 hours of Fleurs data per language, while the validation set consisted of 47 minutes of Common Voice data + 47 minutes of Fleurs data per language. 

### Training Procedure

* The model was trained using the code-based provided by the official [SLAM-ASR Github repository](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) with `torchrun`.
* Only the linear projector was trained.
* The whisper-large-v3-turbo speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
and LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)) were kept frozen.
* No prompt was used during training and inference. 
* Training was conducted with one NVIDIA Ada Lovelace L40S GPU.

#### Training Hyperparameters

|           |        |
| -------- | ------- |
| llm_name  | eurollm-1.7b |
| llm_dim | 2048     |
| context_length | 4096 |
| encoder_name    | whisper    |
| encoder_projector_ds_rate    | 5    |
| encoder_dim    | 1280    |
| encoder_projector    | linear    |
| input_type   | mel    |
| mel_size  | 128    |
| epochs  | 6    |
| freeze_encoder  | true    |
| freeze_llm | true    |
| warmup_steps | 1000    |
| total_steps | 100000    |
| lr | 1e-4    |
| validation_interval | 1000    |
| batch_size_training | 4    |
| val_size_training | 4    |
| num_workers_dataloader | 2   |
| optimizer | AdamW   |
| enable_fdsp | false   |
| enable_ddp | true   |
| use_fp16 | true   |

## Evaluation

The model was evaluated using the Word Error Rate (WER) metric from the `evaluate` library. 
Prior to computing the WER, preprocessing of ground-truth and predicted transcripts was carried out using the `Whisper EnglishTextNormalizer` for English and `BasicTextNormalizer` for all other languages. 
Beam search decoding is used with `beam size = 4`.

### Results

| Dataset  | Language  | WER (%) ↓|
| -------- | ------- | ------- |
| Common Voice 20.0  | English | 13.5 |
| Fleurs  | English | 5.5 |
| Common Voice 20.0  | Italian | 6.4 |
| Fleurs  | Italian | 5.8 |
| Common Voice 20.0  | Spanish | 6.0 |
| Fleurs  | Spanish | 4.3 |
| Common Voice 20.0  | German | 8.8 |
| Fleurs  | German | 10.3 |
 Common Voice 20.0  | French | 11.5 |
| Fleurs  | French | 8.1 |

## Acknowledgements
<img src="images/eloquence_eu.png" align="center" width="30%">
This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).

## Citation 

**BibTeX:**

Please cite the associated Interspeech 2025 paper when using this model:

```
@inproceedings{fong25_interspeech,
  title     = {{Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages}},
  author    = {{Seraphina Fong and Marco Matassoni and Alessio Brutti}},
  year      = {{2025}},
  booktitle = {{Interspeech 2025}},
  pages     = {{2003--2007}},
  doi       = {{10.21437/Interspeech.2025-764}},
}
```