eustlb HF Staff commited on
Commit
4c59fff
·
verified ·
1 Parent(s): 35e79e4

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,3 +1,151 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - en
5
+ - fr
6
+ library_name: moshi
7
+ tags:
8
+ - audio
9
+ - automatic-speech-recognition
10
+ ---
11
+ # Model Card for Kyutai STT
12
+
13
+ This repo is meant to use the model with [Transformers](https://github.com/huggingface/transformers) 🤗
14
+
15
+ Install Transformers from source:
16
+ ```bash
17
+ pip install git+https://github.com/huggingface/transformers
18
+ ```
19
+
20
+ Inference:
21
+ ```python
22
+ import torch
23
+ from datasets import load_dataset, Audio
24
+ from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration
25
+
26
+ # 1. load the model and the processor
27
+ torch_device = "cuda" if torch.cuda.is_available() else "cpu"
28
+ model_id = "kyutai/stt-2.6b-en_fr-trfs"
29
+
30
+ processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
31
+ model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device, torch_dtype="auto")
32
+
33
+ # 2. load audio samples
34
+ ds = load_dataset(
35
+ "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
36
+ )
37
+ ds = ds.cast_column("audio", Audio(sampling_rate=24000))
38
+
39
+ # 3. prepare the model inputs
40
+ inputs = processor(
41
+ ds[0]["audio"]["array"],
42
+ )
43
+ inputs.to(torch_device)
44
+
45
+ # 4. infer the model
46
+ output_tokens = model.generate(**inputs)
47
+
48
+ # 5. decode the generated tokens
49
+ print(processor.batch_decode(output_tokens, skip_special_tokens=True))
50
+ ```
51
+
52
+ Batched inference:
53
+ ```python
54
+ import torch
55
+ from datasets import load_dataset, Audio
56
+ from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration
57
+
58
+ # 1. load the model and the processor
59
+ torch_device = "cuda" if torch.cuda.is_available() else "cpu"
60
+ model_id = "kyutai/stt-2.6b-en_fr-trfs"
61
+
62
+ processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
63
+ model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device, torch_dtype="auto")
64
+
65
+ # 2. load audio samples
66
+ ds = load_dataset(
67
+ "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
68
+ )
69
+ ds = ds.cast_column("audio", Audio(sampling_rate=24000))
70
+
71
+ # 3. prepare the model inputs
72
+ audio_arrays = [ds[i]["audio"]["array"] for i in range(4)]
73
+ inputs = processor(audio_arrays, return_tensors="pt", padding=True)
74
+ inputs = inputs.to(torch_device)
75
+
76
+ # 4. infer the model
77
+ output_tokens = model.generate(**inputs)
78
+
79
+ # 5. decode the generated tokens
80
+ decoded_outputs = processor.batch_decode(output_tokens, skip_special_tokens=True)
81
+ for output in decoded_outputs:
82
+ print(output)
83
+ ```
84
+
85
+ See also the [project page](https://kyutai.org/next/stt)
86
+ and the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/).
87
+
88
+ This is a model for streaming speech-to-text (STT, also known as automatic speech recognition, ASR).
89
+ Unlike offline speech-to-text, where the model needs the entire audio to produce the transcript,
90
+ our model starts to output the transcript as soon as a few seconds of audio become available.
91
+
92
+ ## Model Details
93
+
94
+ The model architecture is a Transformer that consumes audio tokenized by Mimi (see [the Moshi paper](https://arxiv.org/abs/2410.00037)) and outputs text tokens.
95
+ The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens.
96
+
97
+ We release two models:
98
+ - `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad).
99
+ - `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay.
100
+
101
+ ## Model Description
102
+
103
+ Kyutai STT is a decoder-only model for streaming speech-to-text.
104
+ It leverages the multistream architecture of [Moshi](https://moshi.chat/) to model text stream based on the speech stream.
105
+ The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio.
106
+
107
+ * Developed by: Kyutai
108
+ * Model type: Streaming Speech-to-Text transcription.
109
+ * Language(s) (NLP): English and French for `kyutai/stt-1b-en_fr`, English for `kyutai/stt-2.6b-en`
110
+ * License: Model weights are licensed under CC-BY 4.0
111
+ * Repository: [GitHub](https://github.com/kyutai-labs/delayed-streams-modeling/)
112
+
113
+ ## Uses
114
+
115
+ ### Direct Use
116
+
117
+ The model can be used for streaming speech-to-text.
118
+ It is robust to noisy conditions and was found to perform well with audio as long as 2 hours with no additonal changes.
119
+ The model produces transcripts with capitalization and punctuation.
120
+ The predicted text token timestamps can be recovered by subtracting the model's text stream offset (0.5 or 2.5 seconds) from the frame's offset.
121
+
122
+ ## How to Get Started with the Model
123
+
124
+ See the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/).
125
+
126
+ ## Training Details
127
+
128
+ ### Training Data
129
+
130
+ Pretraining stage: For both `kyutai/stt-2.6b-en` and `kyutai/stt-1b-en_fr`, we use an audio collection of 2.5 million hours of publicly available audio content.
131
+ For this dataset, we obtained synthetic transcripts by running [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped).
132
+
133
+ For `kyutai/stt-2.6b-en`:
134
+
135
+ - Finetuning stage: We then finetune the model on a collection of public datasets with
136
+ ground-truth transcripts. This dataset contains 24000 hours of audio.
137
+
138
+ - Long-form finetuning stage: Finally, we finetune the model on a combination of data from the previous stage and long-form audio.
139
+ The long-form audio is obtained from two sources: (a) concatenating LibriSpeech examples (1000 hours), (b) synthesizing dialogs (22000 hours).
140
+
141
+ For `kyutai/stt-1b-en_fr`:
142
+
143
+ - Finetuning stage: We finetune on the Fisher dataset of 2000 hours of English audio, plus proprietary data (1000 hours in English, 600 hours in French).
144
+
145
+ ### Compute Infrastructure
146
+
147
+ Pretraining and finetuning was done with 48 and 16 H100 Nvidia GPUs, respectively.
148
+
149
+ ## Model Card Authors
150
+
151
+ Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez
config.json ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "KyutaiSpeechToTextForConditionalGeneration"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "audio_bos_token_id": 2048,
7
+ "audio_pad_token_id": 69569,
8
+ "bos_token_id": 48000,
9
+ "codebook_vocab_size": 2049,
10
+ "codec_config": {
11
+ "_frame_rate": null,
12
+ "attention_bias": false,
13
+ "attention_dropout": 0.0,
14
+ "audio_channels": 1,
15
+ "codebook_dim": 256,
16
+ "codebook_size": 2048,
17
+ "compress": 2,
18
+ "dilation_growth_rate": 2,
19
+ "head_dim": 64,
20
+ "hidden_act": "gelu",
21
+ "hidden_size": 512,
22
+ "initializer_range": 0.02,
23
+ "intermediate_size": 2048,
24
+ "kernel_size": 7,
25
+ "last_kernel_size": 3,
26
+ "layer_scale_initial_scale": 0.01,
27
+ "max_position_embeddings": 8000,
28
+ "model_type": "mimi",
29
+ "norm_eps": 1e-05,
30
+ "num_attention_heads": 8,
31
+ "num_filters": 64,
32
+ "num_hidden_layers": 8,
33
+ "num_key_value_heads": 8,
34
+ "num_quantizers": 32,
35
+ "num_residual_layers": 1,
36
+ "num_semantic_quantizers": 1,
37
+ "pad_mode": "constant",
38
+ "residual_kernel_size": 3,
39
+ "rope_theta": 10000.0,
40
+ "sampling_rate": 24000,
41
+ "sliding_window": 250,
42
+ "trim_right_ratio": 1.0,
43
+ "upsample_groups": 512,
44
+ "upsampling_ratios": [
45
+ 8,
46
+ 6,
47
+ 5,
48
+ 4
49
+ ],
50
+ "use_cache": false,
51
+ "use_causal_conv": true,
52
+ "use_conv_shortcut": false,
53
+ "use_streaming": false,
54
+ "vector_quantization_hidden_dimension": 256
55
+ },
56
+ "ffn_dim": 11264,
57
+ "frame_size": 1920,
58
+ "head_dim": 128,
59
+ "hidden_act": "silu",
60
+ "hidden_size": 2048,
61
+ "initializer_range": 0.02,
62
+ "max_position_embeddings": 375,
63
+ "model_type": "kyutai_speech_to_text",
64
+ "num_attention_heads": 16,
65
+ "num_codebooks": 32,
66
+ "num_hidden_layers": 16,
67
+ "num_key_value_heads": 16,
68
+ "pad_token_id": 3,
69
+ "rms_norm_eps": 1e-08,
70
+ "rope_theta": 100000.0,
71
+ "sliding_window": 375,
72
+ "tie_word_embeddings": false,
73
+ "torch_dtype": "bfloat16",
74
+ "transformers_version": "4.53.0.dev0",
75
+ "use_cache": true,
76
+ "vocab_size": 8001
77
+ }
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "audio_window_size": 1,
3
+ "bos_token_id": 48000,
4
+ "cache_implementation": "sliding_window",
5
+ "codec_cache_implementation": "sliding_window",
6
+ "codec_use_cache": true,
7
+ "pad_token_id": 3,
8
+ "transformers_version": "4.53.0.dev0"
9
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59a6da960020ea4e1436118c63d7cd73f013e0c62466de12d7ad4b64454fd035
3
+ size 2697201444
preprocessor_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "audio_delay_seconds": 0.5,
3
+ "audio_silence_prefix_seconds": 0.0,
4
+ "chunk_length_s": null,
5
+ "feature_extractor_type": "KyutaiSpeechToTextFeatureExtractor",
6
+ "feature_size": 1,
7
+ "overlap": null,
8
+ "padding_side": "right",
9
+ "padding_value": 0.0,
10
+ "processor_class": "KyutaiSpeechToTextProcessor",
11
+ "return_attention_mask": true,
12
+ "sampling_rate": 24000
13
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "unk_token": "<unk>"
3
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<unk>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<s>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<pad>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ }
35
+ },
36
+ "bos_token_id": null,
37
+ "chat_template": null,
38
+ "clean_up_tokenization_spaces": false,
39
+ "eos_token_id": null,
40
+ "extra_special_tokens": {},
41
+ "model_input_names": [
42
+ "input_ids",
43
+ "attention_mask"
44
+ ],
45
+ "model_max_length": 1000000000000000019884624838656,
46
+ "pad_token_id": null,
47
+ "processor_class": "KyutaiSpeechToTextProcessor",
48
+ "tokenizer_class": "PreTrainedTokenizerFast",
49
+ "unk_token": "<unk>"
50
+ }