--- language: - en - it - es - de - fr pipeline_tag: automatic-speech-recognition license: cc-by-4.0 --- ## Model Details ### Model Description A 17.31M parameter multilingual linear projector trained for automatic speech recognition (ASR) using the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) speechLLM framework. Within this framework, only the linear projector was trained alongside a frozen speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)) and frozen LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)). - **Developed by:** SpeechTek Unit at Fondazione Bruno Kessler - **Funded by:** This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558). - **Model type:** Linear projector in a speechLLM framework - **Supported Language(s):** English, Italian, Spanish, German, French - **License:** CC-BY-4.0 ## Uses This model is trained for Automatic Speech Recognition (ASR). ## How to Get Started with the Model This linear projector checkpoint can be downloaded and utilised for further finetuning or decoding using the shell scripts provided in the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) codebase. Kindly refer to the instructions there for further details. Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector. ## Training Details ### Training Data The linear projector was trained with a total of 500 hours of data from [Common Voice 20.0](https://commonvoice.mozilla.org/) and [Fleurs](https://huggingface.co/datasets/google/fleurs), covering 5 languages (English, Italian, Spanish, German, and French). Specifically, the training set consisted of 92.5 hours of Common Voice data + 7.5 hours of Fleurs data per language, while the validation set consisted of 47 minutes of Common Voice data + 47 minutes of Fleurs data per language. ### Training Procedure * The model was trained using the code-based provided by the official [SLAM-ASR Github repository](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) with `torchrun`. * Only the linear projector was trained. * The whisper-large-v3-turbo speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)) and LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)) were kept frozen. * No prompt was used during training and inference. * Training was conducted with one NVIDIA Ada Lovelace L40S GPU. #### Training Hyperparameters | | | | -------- | ------- | | llm_name | eurollm-1.7b | | llm_dim | 2048 | | context_length | 4096 | | encoder_name | whisper | | encoder_projector_ds_rate | 5 | | encoder_dim | 1280 | | encoder_projector | linear | | input_type | mel | | mel_size | 128 | | epochs | 6 | | freeze_encoder | true | | freeze_llm | true | | warmup_steps | 1000 | | total_steps | 100000 | | lr | 1e-4 | | validation_interval | 1000 | | batch_size_training | 4 | | val_size_training | 4 | | num_workers_dataloader | 2 | | optimizer | AdamW | | enable_fdsp | false | | enable_ddp | true | | use_fp16 | true | ## Evaluation The model was evaluated using the Word Error Rate (WER) metric from the `evaluate` library. Prior to computing the WER, preprocessing of ground-truth and predicted transcripts was carried out using the `Whisper EnglishTextNormalizer` for English and `BasicTextNormalizer` for all other languages. Beam search decoding is used with `beam size = 4`. ### Results | Dataset | Language | WER (%) ↓| | -------- | ------- | ------- | | Common Voice 20.0 | English | 13.5 | | Fleurs | English | 5.5 | | Common Voice 20.0 | Italian | 6.4 | | Fleurs | Italian | 5.8 | | Common Voice 20.0 | Spanish | 6.0 | | Fleurs | Spanish | 4.3 | | Common Voice 20.0 | German | 8.8 | | Fleurs | German | 10.3 | Common Voice 20.0 | French | 11.5 | | Fleurs | French | 8.1 | ## Acknowledgements This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558). ## Citation **BibTeX:** Please cite the associated Interspeech 2025 paper when using this model: ``` @inproceedings{fong25_interspeech, title = {{Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages}}, author = {{Seraphina Fong and Marco Matassoni and Alessio Brutti}}, year = {{2025}}, booktitle = {{Interspeech 2025}}, pages = {{2003--2007}}, doi = {{10.21437/Interspeech.2025-764}}, } ```