spapi commited on
Commit
3573e1a
·
verified ·
1 Parent(s): 82f0ffa

Add model card

Browse files
Files changed (1) hide show
  1. README.md +182 -3
README.md CHANGED
@@ -1,3 +1,182 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - en
5
+ - it
6
+ datasets:
7
+ - FBK-MT/mosel
8
+ - facebook/covost2
9
+ - openslr/librispeech_asr
10
+ - facebook/voxpopuli
11
+ metrics:
12
+ - comet
13
+ - wer
14
+ tags:
15
+ - speech
16
+ - speech recognition
17
+ - speech translation
18
+ - ASR
19
+ - ST
20
+ ---
21
+
22
+ # FAMA-small
23
+ <div>
24
+ <img src="FAMA.png" width="100%" alt="FAMA" />
25
+ </div>
26
+
27
+ ## Table of Contents
28
+ 1. [Overview](#overview)
29
+ 2. [Usage](#Usage)
30
+ 3. [Results](#Results)
31
+ 4. [License](#license)
32
+ 5. [Citation](#citation)
33
+
34
+ ## Overview
35
+
36
+ FAMA is the first family of large-scale open-science SFMs for English and
37
+ Italian trained on [over 150k hours of exclusively open-source(OS)-compliant speech data](https://huggingface.co/datasets/FBK-MT/fama-data).
38
+
39
+ FAMA models achieve [remarkable results](#results), with ASR and ST improvements on average across languages compared to OWSM,
40
+ and is competitive in terms of ASR performance with the Whisper model family while being up to 8 times faster.
41
+
42
+ All the artifacts used for realizing FAMA models, including codebase, datasets, and models
43
+ themself are [released under OS-compliant licenses](#license), promoting a more
44
+ responsible creation of models in our community.
45
+
46
+
47
+ It is available in 2 sizes, with 2 variants for ASR only:
48
+
49
+ - [FAMA-small](https://huggingface.co/FBK-MT/fama-small) - 475 million parameters
50
+ - [FAMA-medium](https://huggingface.co/FBK-MT/fama-medium) - 878 million parameters
51
+ - [FAMA-small-asr](https://huggingface.co/FBK-MT/fama-small-asr) - 475 million parameters
52
+ - [FAMA-medium-asr](https://huggingface.co/FBK-MT/fama-medium-asr) - 878 million parameters
53
+
54
+ For more information about FAMA, please check our [blog post](https://huggingface.co/blog/FAMA/release) and the [arXiv](https://arxiv.org/) preprint.
55
+
56
+ ## Usage
57
+
58
+ FAMA models are supported in Hugging Face 🤗 Transformers.
59
+ To run the model, first install the Transformers and Datasets libraries.
60
+
61
+ ```sh
62
+ pip install transformers==4.48.1 datasets
63
+ ```
64
+
65
+ To perform a single inference on a sample audio file using the
66
+ [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
67
+ class, run:
68
+
69
+ ```python
70
+ import torch
71
+ from transformers import AutoProcessor, pipeline
72
+ from datasets import load_dataset
73
+
74
+ model_id = "FBK-MT/fama-small"
75
+ processor = AutoProcessor.from_pretrained(model_id)
76
+
77
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
78
+ tgt_lang = "en"
79
+
80
+ # Force the model to start with the language tag
81
+ lang_tag = "<lang:{}>".format(tgt_lang)
82
+ lang_tag_id = processor.tokenizer.convert_tokens_to_ids(lang_tag)
83
+
84
+ generate_kwargs = {"num_beams": 5, "no_repeat_ngram_size": 5, "forced_bos_token_id": lang_tag_id}
85
+
86
+ pipe = pipeline(
87
+ "automatic-speech-recognition",
88
+ model=model_id,
89
+ trust_remote_code=True,
90
+ torch_dtype=torch.float32,
91
+ device=device,
92
+ return_timestamps=False,
93
+ generate_kwargs=generate_kwargs
94
+ )
95
+
96
+ dataset = load_dataset("distil-whisper/librispeech_asr_dummy", "clean", split="validation")
97
+ sample = dataset[0]["audio"]
98
+
99
+ result = pipe(sample)
100
+ print(result["text"])
101
+ ```
102
+
103
+ Where `tgt_lang` is the target language (either `en` or `it`). The source languages has not to be specified.
104
+ To run the inference on a local audio file `audio.wav`, call the pipeline with:
105
+
106
+ ```python
107
+ result = pipe("audio.wav")
108
+ ```
109
+
110
+ To perform a batch inference with size `batch_size`, run:
111
+
112
+ ```python
113
+ result = pipe(["audio_1.wav", "audio_2.wav"], batch_size=2)
114
+ ```
115
+
116
+ For the inference, we suggest converting the audio files in wav format with 16kHz sampling rate and 1 channel.
117
+
118
+ ## Results
119
+
120
+ We evaluate FAMA on ASR and ST tasks using popular open-source datasets such as CommonVoice, Multilingual LibriSpeech (MLS), VoxPopuli, CoVoST2 and FLEURS.
121
+ The metrics used are WER (↓) for ASR, and COMET (↑) for ST.
122
+
123
+ We also benchmark FAMA in terms of computational time and maximum batch size supported on HuggingFace against Whisper and SeamlessM4T models. The metric used is the inverse real time factor (xRTF).
124
+
125
+ **Key highlights:**
126
+ - FAMA achieves up to 4.2 WER and 0.152 COMET improvement on average across languages compared to OWSM v3.1
127
+ - FAMA is up to 8 times faster than Whisper large-v3 while achieving comparable ASR performance
128
+
129
+
130
+ ### Automatic Speech Recogniton (ASR)
131
+ | ***Model/Dataset WER (↓)*** | **CommonVoice**-*en* | **CommonVoice**-*it* | **MLS**-*en* | **MLS**-*it* | **VoxPopuli**-*en* | **VoxPopuli**-*it* | **AVG**-*en* | **AVG**-*it* |
132
+ |-----------------------------------------|---------|---------|---------|---------|---------|----------|---------|----------|
133
+ | Whisper *medium* | 14.5 | 10.4 | 14.2 | 15.9 | 8.1 | 26.8 | 12.3 | 17.7 |
134
+ | Whisper *large-v3* | 11.2 | 6.5 | **5.0** | 8.8 | 7.1 | 18.8 | 7.8 | 11.4 |
135
+ | OWSM v3.1 *medium* | 11.9 | 12.5 | 6.6 | 19.3 | 8.4 | 24.0 | 9.0 | 18.6 |
136
+ | SeamlessM4T *medium* | 10.7 | 7.8 | 8.8 | 11.3 | 10.2 | 18.2 | 9.9 | 12.4 |
137
+ | SeamlessM4T *v2-large* | **7.7** | **5.0** | 6.4 | **8.5** | **6.9** | 16.6 | **7.0** | **10.0** |
138
+ | FAMA-ASR *small* | 13.8 | 8.9 | 5.8 | 12.6 | 7.2 | 15.7 | 8.9 | 12.4 |
139
+ | FAMA-ASR *medium* | 11.7 | 7.1 | 5.1 | 12.2 | 7.0 | 15.9 | 7.9 | 11.7 |
140
+ | FAMA *small* | 13.7 | 8.6 | 5.8 | 12.8 | 7.3 | **15.6** | 8.9 | 12.3 |
141
+ | FAMA *medium* | 11.5 | 7.0 | 5.2 | 13.9 | 7.2 | 15.9 | 8.0 | 12.3 |
142
+
143
+ ### Speech Translation (ST)
144
+ | ***Model/Dataset WER (↓)*** | **CoVoST2**-*it→en* | **FLEURS**-*en→it* |
145
+ |-----------------------------------------|---------------------|--------------------|
146
+ | Whisper *medium* | 0.801 | - |
147
+ | Whisper *large-v3* | 0.825 | - |
148
+ | OWSM v3.1 *medium* | 0.636 | 0.337 |
149
+ | SeamlessM4T *medium* | 0.831 | 0.820 |
150
+ | SeamlessM4T *v2-large* | **0.852** | **0.855** |
151
+ | FAMA *small* | 0.774 | 0.807 |
152
+ | FAMA *medium* | 0.787 | 0.821 |
153
+
154
+ ### Computational Time and Maximum Batch Size
155
+
156
+ | ***Model*** | ***Batch Size*** | ***xRTF en (↑)*** | ***xRTF it (↑)*** | ***xRTF AVG (↑)*** |
157
+ |------------------------|------------|-------------|-------------|--------------|
158
+ | Whisper *medium* | 8 | 13.3 | 10.9 | 12.1 |
159
+ | Whisper *large-v3* | 4 | 7.9 | 6.5 | 7.2 |
160
+ | SeamlessM4T *medium* | 2 | 28.5 | 26.2 | 27.4 |
161
+ | SeamlessM4T *v2-large* | 2 | 13.7 | 13.3 | 13.5 |
162
+ | FAMA *small* | 16 | **57.4** | **56.0** | **56.7** |
163
+ | FAMA *medium* | 8 | 39.5 | 41.2 | 40.4 |
164
+
165
+
166
+ ## License
167
+
168
+ We release the FAMA model weights, and training data under the CC-BY 4.0 license.
169
+ The training data can be found in [FAMA Training Data](https://huggingface.co/datasets/FBK-MT/fama-data).
170
+ The [original FBK-fairseq codebase](https://github.com/hlt-mt/FBK-fairseq) used to train the model is released under the Apache 2.0 license.
171
+
172
+ ## Citation
173
+
174
+ If you use FAMA in your work, please cite:
175
+
176
+ ```
177
+ @misc{papi2025fama,
178
+ title={FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian},
179
+ author={Sara Papi and Marco Gaido and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabih and Matteo Negri},
180
+ year={2025}
181
+ }
182
+ ```