seraphina commited on
Commit
7924637
·
verified ·
1 Parent(s): bc9be71

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -0
README.md ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - it
5
+ - es
6
+ - de
7
+ - fr
8
+ pipeline_tag: automatic-speech-recognition
9
+ ---
10
+ ## Model Details
11
+
12
+ ### Model Description
13
+
14
+
15
+ A 17.31M parameter multilingual linear projector trained for automatic speech recognition (ASR) using the SLAM-ASR speechLLM framework.
16
+ Within this framework, only the linear projector was trained alongisde a frozen speech encoder ([Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
17
+ and frozen LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)).
18
+
19
+ - **Developed by:** SpeechTek Unit at Fondazione Bruno Kessler
20
+ - **Funded by:** This work was partially funded by the European Union’s Horizon 2020 project ELOQUENCE (grant 101070558).
21
+ - **Model type:** Linear projector in a speechLLM framework
22
+ - **Supported Language(s):** English, Italian, Spanish, German, French
23
+ - **License:** [More Information Needed]
24
+
25
+ ## Uses
26
+
27
+ This model is trained for Automatic Speech Recognition (ASR).
28
+
29
+ ## How to Get Started with the Model
30
+
31
+ This linear projector can be used using the shell scripts provided in the [SLAM-ASR](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) codebase. Kindly refer to the instructions there with regards to data preparation and decoding.
32
+
33
+ Whisper-large-v3-turbo and EuroLLM 1.7B must be downloaded before using this linear projector.
34
+
35
+ ## Training Details
36
+
37
+ ### Training Data
38
+
39
+ The linear projector was trained with a total of 500 hours of data from [Common Voice 20.0](https://commonvoice.mozilla.org/) and [Fleurs](https://huggingface.co/datasets/google/fleurs), covering 5 languages (English, Italian, Spanish, German, and French).
40
+ Specifically, the training set consisted of 92.5 hours of Common Voice data + 7.5 hours of Fleurs data per language, while the validation set consisted of 47 minutes of Common Voice data + 47 minutes of Fleurs data per language.
41
+
42
+ ### Training Procedure
43
+
44
+ The linear projector was trained using the code-based provided by the official [SLAM-ASR Github repository](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/asr_librispeech) with `torchrun`.
45
+ Only the linear projector was trained. The whisper-large-v3-turbo speech encoder (Whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo))
46
+ and LLM ([EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)) were kept frozen.
47
+
48
+ Training was conducted with one NVIDIA Ada Lovelace L40S GPU.
49
+
50
+
51
+ #### Training Hyperparameters
52
+
53
+ | | |
54
+ | -------- | ------- |
55
+ | llm_name | eurollm-1.7b |
56
+ | llm_dim | 2048 |
57
+ | context_length | 4096 |
58
+ | encoder_name | whisper |
59
+ | encoder_projector_ds_rate | 5 |
60
+ | encoder_dim | 1280 |
61
+ | encoder_projector | linear |
62
+ | input_type | mel |
63
+ | mel_size | 128 |
64
+ | epochs | 6 |
65
+ | freeze_encoder | true |
66
+ | freeze_llm | true |
67
+ | warmup_steps | 1000 |
68
+ | total_steps | 100000 |
69
+ | lr | 1e-4 |
70
+ | validation_interval | 1000 |
71
+ | batch_size_training | 4 |
72
+ | val_size_training | 4 |
73
+ | num_workers_dataloader | 2 |
74
+ | optimizer | AdamW |
75
+ | enable_fdsp | false |
76
+ | enable_ddp | true |
77
+ | use_fp16 | true |
78
+
79
+
80
+ ## Evaluation
81
+
82
+ ### Results
83
+
84
+ [More Information Needed]
85
+
86
+ | Dataset | Language | WER (%) ↓|
87
+ | -------- | ------- | ------- |
88
+ | Common Voice 20.0 | English | 13.5 |
89
+ | Fleurs | English | 5.5 |
90
+ | Common Voice 20.0 | Italian | 6.4 |
91
+ | Fleurs | Italian | 5.8 |
92
+ | Common Voice 20.0 | Spanish | 6.0 |
93
+ | Fleurs | Spanish | 4.3 |
94
+ | Common Voice 20.0 | German | 8.8 |
95
+ | Fleurs | German | 10.3 |
96
+ Common Voice 20.0 | French | 11.5 |
97
+ | Fleurs | French | 8.1 |
98
+
99
+
100
+
101
+
102
+ ## Citation [optional]
103
+
104
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
105
+
106
+ **BibTeX:**
107
+
108
+ [More Information Needed]
109
+
110
+ **APA:**
111
+
112
+ [More Information Needed]
113
+
114
+ ## More Information [optional]
115
+
116
+ [More Information Needed]
117
+
118
+ ## Model Card Authors [optional]
119
+
120
+ [More Information Needed]
121
+
122
+ ## Model Card Contact
123
+
124
+ [More Information Needed]