SpeechTek
/

mEUltilingual_speechllm_linear_projector_v1

Automatic Speech Recognition

Model card Files Files and versions

xet

Community

seraphina commited on Jun 11

Commit

7924637

verified ·

1 Parent(s): bc9be71

Create README.md

Browse files

Files changed (1) hide show

README.md +124 -0

README.md ADDED Viewed

	@@ -0,0 +1,124 @@

+---
+language:
+- en
+- it
+- es
+- de
+- fr
+pipeline_tag: automatic-speech-recognition
+---
+## Model Details
+### Model Description
+A 17.31M parameter multilingual linear projector trained for automatic speech recognition (ASR) using the SLAM-ASR speechLLM framework.
+Within this framework, only the linear projector was trained alongisde a frozen speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
+and frozen LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)).
+- **Developed by:** SpeechTek Unit at Fondazione Bruno Kessler
+- **Funded by:** This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
+- **Model type:** Linear projector in a speechLLM framework
+- **Supported Language(s):** English, Italian, Spanish, German, French
+- **License:** [More Information Needed]
+## Uses
+This model is trained for Automatic Speech Recognition (ASR).
+## How to Get Started with the Model
+This linear projector can be used using the shell scripts provided in the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) codebase. Kindly refer to the instructions there with regards to data preparation and decoding.
+Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector.
+## Training Details
+### Training Data
+The linear projector was trained with a total of 500 hours of data from [Common Voice 20.0](https://commonvoice.mozilla.org/) and [Fleurs](https://huggingface.co/datasets/google/fleurs), covering 5 languages (English, Italian, Spanish, German, and French).
+Specifically, the training set consisted of 92.5 hours of Common Voice data + 7.5 hours of Fleurs data per language, while the validation set consisted of 47 minutes of Common Voice data + 47 minutes of Fleurs data per language.
+### Training Procedure
+The linear projector was trained using the code-based provided by the official [SLAM-ASR Github repository](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) with `torchrun`.
+Only the linear projector was trained. The whisper-large-v3-turbo speech encoder (Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
+and LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)) were kept frozen.
+Training was conducted with one NVIDIA Ada Lovelace L40S GPU.
+#### Training Hyperparameters
+|   |  |
+| -------- | ------- |
+| llm_name  | eurollm-1.7b |
+| llm_dim | 2048     |
+| context_length | 4096 |
+| encoder_name    | whisper    |
+| encoder_projector_ds_rate    | 5    |
+| encoder_dim    | 1280    |
+| encoder_projector    | linear    |
+| input_type   | mel    |
+| mel_size  | 128    |
+| epochs  | 6    |
+| freeze_encoder  | true    |
+| freeze_llm | true    |
+| warmup_steps | 1000    |
+| total_steps | 100000    |
+| lr | 1e-4    |
+| validation_interval | 1000    |
+| batch_size_training | 4    |
+| val_size_training | 4    |
+| num_workers_dataloader | 2   |
+| optimizer | AdamW   |
+| enable_fdsp | false   |
+| enable_ddp | true   |
+| use_fp16 | true   |
+## Evaluation
+### Results
+[More Information Needed]
+| Dataset  | Language  | WER (%) &#8595;|
+| -------- | ------- | ------- |
+| Common Voice 20.0  | English | 13.5 |
+| Fleurs  | English | 5.5 |
+| Common Voice 20.0  | Italian | 6.4 |
+| Fleurs  | Italian | 5.8 |
+| Common Voice 20.0  | Spanish | 6.0 |
+| Fleurs  | Spanish | 4.3 |
+| Common Voice 20.0  | German | 8.8 |
+| Fleurs  | German | 10.3 |
+ Common Voice 20.0  | French | 11.5 |
+| Fleurs  | French | 8.1 |
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]