FBK-MT
/

fama-small

+---
+license: cc-by-4.0
+language:
+- en
+- it
+datasets:
+- FBK-MT/mosel
+- facebook/covost2
+- openslr/librispeech_asr
+- facebook/voxpopuli
+metrics:
+- comet
+- wer
+tags:
+- speech
+- speech recognition
+- speech translation
+- ASR
+- ST
+---
+# FAMA-small
+<div>
+  <img src="FAMA.png" width="100%"  alt="FAMA" />
+</div>
+## Table of Contents
+1. [Overview](#overview)
+2. [Usage](#Usage)
+3. [Results](#Results)
+4. [License](#license)
+5. [Citation](#citation)
+## Overview
+FAMA is the first family of large-scale open-science SFMs for English and
+Italian trained on [over 150k hours of exclusively open-source(OS)-compliant speech data](https://huggingface.co/datasets/FBK-MT/fama-data).
+FAMA models achieve [remarkable results](#results), with ASR and ST improvements on average across languages compared to OWSM,
+and is competitive in terms of ASR performance with the Whisper model family while being up to 8 times faster.
+All the artifacts used for realizing FAMA models, including codebase, datasets, and models
+themself are [released under OS-compliant licenses](#license), promoting a more
+responsible creation of models in our community.
+It is available in 2 sizes, with 2 variants for ASR only:
+- [FAMA-small](https://huggingface.co/FBK-MT/fama-small) - 475 million parameters
+- [FAMA-medium](https://huggingface.co/FBK-MT/fama-medium) - 878 million parameters
+- [FAMA-small-asr](https://huggingface.co/FBK-MT/fama-small-asr) - 475 million parameters
+- [FAMA-medium-asr](https://huggingface.co/FBK-MT/fama-medium-asr) - 878 million parameters
+For more information about FAMA, please check our [blog post](https://huggingface.co/blog/FAMA/release) and the [arXiv](https://arxiv.org/) preprint.
+## Usage
+FAMA models are supported in Hugging Face 🤗 Transformers.
+To run the model, first install the Transformers and Datasets libraries.
+```sh
+pip install transformers==4.48.1 datasets
+```
+To perform a single inference on a sample audio file using the
+[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
+class, run:
+```python
+import torch
+from transformers import AutoProcessor, pipeline
+from datasets import load_dataset
+model_id = "FBK-MT/fama-small"
+processor = AutoProcessor.from_pretrained(model_id)
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+tgt_lang = "en"
+# Force the model to start with the language tag
+lang_tag = "<lang:{}>".format(tgt_lang)
+lang_tag_id = processor.tokenizer.convert_tokens_to_ids(lang_tag)
+generate_kwargs = {"num_beams": 5, "no_repeat_ngram_size": 5, "forced_bos_token_id": lang_tag_id}
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model=model_id,
+    trust_remote_code=True,
+    torch_dtype=torch.float32,
+    device=device,
+    return_timestamps=False,
+    generate_kwargs=generate_kwargs
+)
+dataset = load_dataset("distil-whisper/librispeech_asr_dummy", "clean", split="validation")
+sample = dataset[0]["audio"]
+result = pipe(sample)
+print(result["text"])
+```
+Where `tgt_lang` is the target language (either `en` or `it`). The source languages has not to be specified.
+To run the inference on a local audio file `audio.wav`, call the pipeline with:
+```python
+result = pipe("audio.wav")
+```
+To perform a batch inference with size `batch_size`, run:
+```python
+result = pipe(["audio_1.wav", "audio_2.wav"], batch_size=2)
+```
+For the inference, we suggest converting the audio files in wav format with 16kHz sampling rate and 1 channel.
+## Results
+We evaluate FAMA on ASR and ST tasks using popular open-source datasets such as CommonVoice, Multilingual LibriSpeech (MLS), VoxPopuli, CoVoST2 and FLEURS.
+The metrics used are WER (↓) for ASR, and COMET (↑) for ST.
+We also benchmark FAMA in terms of computational time and maximum batch size supported on HuggingFace against Whisper and SeamlessM4T models. The metric used is the inverse real time factor (xRTF).
+**Key highlights:**
+- FAMA achieves up to 4.2 WER and 0.152 COMET improvement on average across languages compared to OWSM v3.1
+- FAMA is up to 8 times faster than Whisper large-v3 while achieving comparable ASR performance
+### Automatic Speech Recogniton (ASR)
+| ***Model/Dataset WER (↓)***             | **CommonVoice**-*en* | **CommonVoice**-*it* | **MLS**-*en* | **MLS**-*it* | **VoxPopuli**-*en* | **VoxPopuli**-*it* | **AVG**-*en* | **AVG**-*it* |
+|-----------------------------------------|---------|---------|---------|---------|---------|----------|---------|----------|
+| Whisper *medium*                        | 14.5    | 10.4    | 14.2    | 15.9    | 8.1     | 26.8     | 12.3    | 17.7     |
+| Whisper *large-v3*                      | 11.2    | 6.5     | **5.0** | 8.8     | 7.1     | 18.8     | 7.8     | 11.4     |
+| OWSM v3.1 *medium*                      | 11.9    | 12.5    | 6.6     | 19.3    | 8.4     | 24.0     | 9.0     | 18.6     |
+| SeamlessM4T *medium*                    | 10.7    | 7.8     | 8.8     | 11.3    | 10.2    | 18.2     | 9.9     | 12.4     |
+| SeamlessM4T *v2-large*                  | **7.7** | **5.0** | 6.4     | **8.5** | **6.9** | 16.6     | **7.0** | **10.0** |
+| FAMA-ASR *small*                        | 13.8    | 8.9     | 5.8     | 12.6    | 7.2     | 15.7     | 8.9     | 12.4     |
+| FAMA-ASR *medium*                       | 11.7    | 7.1     | 5.1     | 12.2    | 7.0     | 15.9     | 7.9     | 11.7     |
+| FAMA *small*                            | 13.7    | 8.6     | 5.8     | 12.8    | 7.3     | **15.6** | 8.9     | 12.3     |
+| FAMA *medium*                           | 11.5    | 7.0     | 5.2     | 13.9    | 7.2     | 15.9     | 8.0     | 12.3     |
+### Speech Translation (ST)
+| ***Model/Dataset WER (↓)***             | **CoVoST2**-*it→en* | **FLEURS**-*en→it* |
+|-----------------------------------------|---------------------|--------------------|
+| Whisper *medium*                        | 0.801               | -                  |
+| Whisper *large-v3*                      | 0.825               | -                  |
+| OWSM v3.1 *medium*                      | 0.636               | 0.337              |
+| SeamlessM4T *medium*                    | 0.831               | 0.820              |
+| SeamlessM4T *v2-large*                  | **0.852**           | **0.855**          |
+| FAMA *small*                            | 0.774               | 0.807              |
+| FAMA *medium*                           | 0.787               | 0.821              |
+### Computational Time and Maximum Batch Size
+| ***Model***            | ***Batch Size*** | ***xRTF en (↑)*** | ***xRTF it (↑)*** | ***xRTF AVG (↑)*** |
+|------------------------|------------|-------------|-------------|--------------|
+| Whisper *medium*       | 8          | 13.3        | 10.9        | 12.1         |
+| Whisper *large-v3*     | 4          | 7.9         | 6.5         | 7.2          |
+| SeamlessM4T *medium*   | 2          | 28.5        | 26.2        | 27.4         |
+| SeamlessM4T *v2-large* | 2          | 13.7        | 13.3        | 13.5         |
+| FAMA *small*           | 16         | **57.4**    | **56.0**    | **56.7**     |
+| FAMA *medium*          | 8          | 39.5        | 41.2        | 40.4         |
+## License
+We release the FAMA model weights, and training data under the CC-BY 4.0 license.
+The training data can be found in [FAMA Training Data](https://huggingface.co/datasets/FBK-MT/fama-data).
+The [original FBK-fairseq codebase](https://github.com/hlt-mt/FBK-fairseq) used to train the model is released under the Apache 2.0 license.
+## Citation
+If you use FAMA in your work, please cite:
+```
+@misc{papi2025fama,
+      title={FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian},
+      author={Sara Papi and Marco Gaido and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabih and Matteo Negri},
+      year={2025}
+}
+```