Automatic Speech Recognition
Transformers
Safetensors
English
Japanese
whisper
audio
hf-asr-leaderboard
asahi417 commited on
Commit
b9721ec
·
verified ·
1 Parent(s): db8843c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -169
README.md CHANGED
@@ -32,10 +32,18 @@ and [Multilingual LibriSpeech](https://huggingface.co/datasets/japanese-asr/en_a
32
  Kotoba-whisper-bilingual's loss objective consists of cross-entropy on both of ASR and translation tasks, while KL divergence loss only for ASR task.
33
  The student model consists the full encoder of the teacher large-v3 model and the decoder with two layers initialized from the first and last layer of the large-v3 model.
34
 
35
- Kotoba-Whisper is **6.3x faster than large-v3**, while retaining as low error rate as the large-v3.
 
 
 
36
 
37
  ## Evaluation
38
 
 
 
 
 
 
39
  ### Speech2Text Translation (Japanese->English)
40
 
41
  | model | [CoVoST2 (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation)| [Fleurs (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation) |
@@ -105,20 +113,24 @@ Kotoba-Whisper is **6.3x faster than large-v3**, while retaining as low error ra
105
  | [japanese-asr/distil-whisper-bilingual-v1.0](https://huggingface.co/japanese-asr/distil-whisper-bilingual-v1.0) | 20.7 | 18.6 | 2.4 | 6.4 | 10 |
106
 
107
 
 
 
 
108
 
109
-
110
-
111
-
112
-
113
- - ***Latency***: As kotoba-whisper uses the same architecture as [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3),
114
- it inherits the benefit of the improved latency compared to [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)
115
- (**6.3x faster than large-v3**, see the table below taken from [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)).
116
-
117
- | Model | Params / M | Rel. Latency |
118
- |----------------------------------------------------------------------------------------------|------------|--------------|
119
- | **[kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0)**| **756** | **6.3** |
120
- | **[kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)**| **756** | **6.3** |
121
- | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 |
 
122
 
123
 
124
  ## Transformers Usage
@@ -136,7 +148,8 @@ class to transcribe short-form audio files (< 30-seconds) as follows:
136
 
137
  Download sample audio.
138
  ```shell
139
- wget
 
140
  ```
141
 
142
  ```python
@@ -148,32 +161,35 @@ from datasets import load_dataset
148
  torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
149
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
150
  model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
151
-
152
- # load model
153
  pipe = pipeline(
154
  "automatic-speech-recognition",
155
  model="kotoba-tech/kotoba-whisper-bilingual-v1.0",
156
  torch_dtype=torch_dtype,
157
  device=device,
158
- model_kwargs=model_kwargs
 
 
159
  )
160
 
161
-
162
  generate_kwargs = {"language": "ja", "task": "transcribe"}
 
 
163
 
164
- # load sample audio
165
- dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
166
- sample = dataset[0]["audio"]
 
167
 
168
- # run inference
169
- result = pipe(sample, generate_kwargs=generate_kwargs)
 
170
  print(result["text"])
171
- ```
172
 
173
- - To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline (make sure the audio is sampled in 16kHz):
174
- ```diff
175
- - result = pipe(sample, generate_kwargs=generate_kwargs)
176
- + result = pipe("audio.mp3", generate_kwargs=generate_kwargs)
177
  ```
178
 
179
  - For segment-level timestamps, pass the argument `return_timestamps=True` and return the `"chunks"` output:
@@ -182,152 +198,12 @@ result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
182
  print(result["chunks"])
183
  ```
184
 
185
- ***Sequential Long-Form:*** Kotoba-whisper is designed to be compatible with OpenAI's sequential long-form transcription algorithm. This algorithm uses a sliding window for buffered
186
- inference of long audio files (> 30-seconds), and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).
187
- As default, if long audio files are passed to the model, it will transcribes with the sequential long-form transcription.
188
- The sequential long-form algorithm should be used in either of the following scenarios:
189
-
190
- 1. Transcription accuracy is the most important factor, and latency is less of a consideration
191
- 2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
192
-
193
- If you are transcribing single long audio files and latency is the most important factor, you should use the chunked algorithm
194
- described [below](#chunked-long-form). For a detailed explanation of the different algorithms, refer to Sections 5 of
195
- the [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf). The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
196
- class can be used to transcribe long audio files with the sequential algorithm as follows:
197
-
198
-
199
- ### Chunked Long-Form
200
- This algorithm should be used when a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances,
201
- the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf)).
202
- To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For distil-large-v3, a chunk length of 25-seconds
203
- is optimal. To activate batching over long audio files, pass the argument `batch_size`:
204
-
205
- ```python
206
- import torch
207
- from transformers import pipeline
208
- from datasets import load_dataset
209
-
210
- # config
211
- model_id = "kotoba-tech/kotoba-whisper-v2.0"
212
- torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
213
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
214
- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
215
- generate_kwargs = {"language": "ja", "task": "transcribe"}
216
-
217
- # load model
218
- pipe = pipeline(
219
- "automatic-speech-recognition",
220
- model=model_id,
221
- torch_dtype=torch_dtype,
222
- device=device,
223
- model_kwargs=model_kwargs,
224
- chunk_length_s=15,
225
- batch_size=16
226
- )
227
-
228
- # load sample audio (concatenate instances to create a long audio)
229
- dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
230
- sample = {"array": np.concatenate([i["array"] for i in dataset[:20]["audio"]]), "sampling_rate": dataset[0]['audio']['sampling_rate']}
231
-
232
- # run inference
233
- result = pipe(sample, generate_kwargs=generate_kwargs)
234
- print(result["text"])
235
- ```
236
-
237
-
238
- ### Additional Speed & Memory Improvements
239
- You can apply additional speed and memory improvements to further reduce the inference speed and VRAM
240
- requirements. These optimisations primarily target the attention kernel, swapping it from an eager implementation to a
241
- more efficient flash attention version.
242
-
243
- #### Flash Attention 2
244
-
245
- We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2)
246
- if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
247
-
248
- ```
249
- pip install flash-attn --no-build-isolation
250
- ```
251
-
252
- Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
253
-
254
- ```diff
255
- - model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
256
- + model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}
257
- ```
258
-
259
-
260
- ## Model Details
261
- See [https://huggingface.co/distil-whisper/distil-large-v3#model-details](https://huggingface.co/distil-whisper/distil-large-v3#model-details).
262
-
263
 
264
  ## Training
265
  Please refer to [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper) for the model training detail.
266
  Datasets used in distillation and the whole model variations can be found at [https://huggingface.co/japanese-asr](https://huggingface.co/japanese-asr).
267
 
268
 
269
- ## Evaluation
270
- The following code-snippets demonstrates how to evaluate the kotoba-whisper model on the Japanese subset of the CommonVoice 8.0.
271
- First, we need to install the required packages, including 🤗 Datasets to load the audio data, and 🤗 Evaluate to
272
- perform the WER calculation:
273
-
274
- ```bash
275
- pip install --upgrade pip
276
- pip install --upgrade transformers datasets[audio] evaluate jiwer
277
- ```
278
-
279
- Evaluation can then be run end-to-end with the following example:
280
-
281
- ```python
282
- import torch
283
- from transformers import pipeline
284
- from datasets import load_dataset
285
- from evaluate import load
286
- from transformers.models.whisper.english_normalizer import BasicTextNormalizer
287
-
288
- # model config
289
- model_id = "kotoba-tech/kotoba-whisper-v2.0"
290
- torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
291
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
292
- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
293
- generate_kwargs = {"language": "japanese", "task": "transcribe"}
294
- normalizer = BasicTextNormalizer()
295
-
296
- # data config
297
- dataset_name = "japanese-asr/ja_asr.reazonspeech_test"
298
- audio_column = 'audio'
299
- text_column = 'transcription'
300
-
301
- # load model
302
- pipe = pipeline(
303
- "automatic-speech-recognition",
304
- model=model_id,
305
- torch_dtype=torch_dtype,
306
- device=device,
307
- model_kwargs=model_kwargs,
308
- batch_size=16
309
- )
310
-
311
- # load the dataset and sample the audio with 16kHz
312
- dataset = load_dataset(dataset_name, split="test")
313
- transcriptions = pipe(dataset['audio'])
314
- transcriptions = [normalizer(i['text']).replace(" ", "") for i in transcriptions]
315
- references = [normalizer(i).replace(" ", "") for i in dataset['transcription']]
316
-
317
- # compute the CER metric
318
- cer_metric = load("cer")
319
- cer = 100 * cer_metric.compute(predictions=transcriptions, references=references)
320
- print(cer)
321
- ```
322
-
323
- The huggingface links to the major Japanese ASR datasets for evaluation are summarized at [here](https://huggingface.co/collections/japanese-asr/japanese-asr-evaluation-dataset-66051a03d6ca494d40baaa26).
324
- For example, to evaluate the model on JSUT Basic5000, change the `dataset_name`:
325
-
326
- ```diff
327
- - dataset_name = "japanese-asr/ja_asr.reazonspeech_test"
328
- + dataset_name = "japanese-asr/ja_asr.jsut_basic5000"
329
- ```
330
-
331
  ## Acknowledgements
332
  * [OpenAI](https://openai.com/) for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
333
  * Hugging Face 🤗 [Transformers](https://github.com/huggingface/transformers) for the model integration.
 
32
  Kotoba-whisper-bilingual's loss objective consists of cross-entropy on both of ASR and translation tasks, while KL divergence loss only for ASR task.
33
  The student model consists the full encoder of the teacher large-v3 model and the decoder with two layers initialized from the first and last layer of the large-v3 model.
34
 
35
+ As kotoba-whisper uses the same architecture as [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3),
36
+ it inherits the benefit of the improved latency compared to [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)
37
+ (**6.3x faster than large-v3**, see the table below taken from [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)).
38
+
39
 
40
  ## Evaluation
41
 
42
+ We compare our kotoba-whisper-bilingual with OpenAI whisper models, kotoba-whisper models, and cascaded models for translation.
43
+ Worth noting that kotoba-whisper-bilingual is the only model that can do Japanese and English ASR and translation between Japanese and English, as
44
+ OpenAI whisper is not trained for English to Japanese translation, and other models are specific to the Task (eg. kotoba-whisper is Japanese ASR and
45
+ distil whisper is English ASR only).
46
+
47
  ### Speech2Text Translation (Japanese->English)
48
 
49
  | model | [CoVoST2 (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation)| [Fleurs (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation) |
 
113
  | [japanese-asr/distil-whisper-bilingual-v1.0](https://huggingface.co/japanese-asr/distil-whisper-bilingual-v1.0) | 20.7 | 18.6 | 2.4 | 6.4 | 10 |
114
 
115
 
116
+ ### Inference Speed
117
+ Although the cascaded approach is better in translation task, due to the nature of cascaded approach, the pipeline has additional complexity compared to the single end2end models for the sake of high accuracy.
118
+ Following table shows the mean inference time in second averaged over 10 trials on audio sample with different durations.
119
 
120
+ | model | 10 | 30 | 60 | 300 |
121
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|------:|------:|------:|
122
+ | [kotoba-tech/kotoba-whisper-bilingual-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0) | 0.041 | 0.111 | 0.214 | 1.077 |
123
+ | [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)) | 0.173 | 0.247 | 0.352 | 1.772 |
124
+ | [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)) | 0.173 | 0.24 | 0.348 | 1.515 |
125
+ | [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | 0.17 | 0.245 | 0.348 | 1.882 |
126
+ | [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)) | 0.108 | 0.179 | 0.283 | 1.33 |
127
+ | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 0.061 | 0.184 | 0.372 | 1.804 |
128
+ | [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 0.062 | 0.199 | 0.415 | 1.854 |
129
+ | [openai/whisper-large](https://huggingface.co/openai/whisper-large) | 0.062 | 0.183 | 0.363 | 1.899 |
130
+ | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 0.045 | 0.132 | 0.266 | 1.368 |
131
+ | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 0.135 | 0.376 | 0.631 | 3.495 |
132
+ | [openai/whisper-base](https://huggingface.co/openai/whisper-base) | 0.054 | 0.108 | 0.231 | 1.019 |
133
+ | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 0.045 | 0.124 | 0.208 | 0.838 |
134
 
135
 
136
  ## Transformers Usage
 
148
 
149
  Download sample audio.
150
  ```shell
151
+ wget https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval/resolve/main/sample.wav -O sample_en.wav
152
+ wget https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000/resolve/main/sample.flac -O sample_ja.flac
153
  ```
154
 
155
  ```python
 
161
  torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
162
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
163
  model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
 
 
164
  pipe = pipeline(
165
  "automatic-speech-recognition",
166
  model="kotoba-tech/kotoba-whisper-bilingual-v1.0",
167
  torch_dtype=torch_dtype,
168
  device=device,
169
+ model_kwargs=model_kwargs,
170
+ chunk_length_s=15,
171
+ batch_size=16
172
  )
173
 
174
+ # Japanese ASR
175
  generate_kwargs = {"language": "ja", "task": "transcribe"}
176
+ result = pipe("sample_ja.flac", generate_kwargs=generate_kwargs)
177
+ print(result["text"])
178
 
179
+ # English ASR
180
+ generate_kwargs = {"language": "en", "task": "transcribe"}
181
+ result = pipe("sample_en.wav", generate_kwargs=generate_kwargs)
182
+ print(result["text"])
183
 
184
+ # Translate Japanese speech to English text
185
+ generate_kwargs = {"language": "en", "task": "translate"}
186
+ result = pipe("sample_ja.flac", generate_kwargs=generate_kwargs)
187
  print(result["text"])
 
188
 
189
+ # Translate English speech to Japanese text
190
+ generate_kwargs = {"language": "ja", "task": "translate"}
191
+ result = pipe("sample_en.wav", generate_kwargs=generate_kwargs)
192
+ print(result["text"])
193
  ```
194
 
195
  - For segment-level timestamps, pass the argument `return_timestamps=True` and return the `"chunks"` output:
 
198
  print(result["chunks"])
199
  ```
200
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
201
 
202
  ## Training
203
  Please refer to [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper) for the model training detail.
204
  Datasets used in distillation and the whole model variations can be found at [https://huggingface.co/japanese-asr](https://huggingface.co/japanese-asr).
205
 
206
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
207
  ## Acknowledgements
208
  * [OpenAI](https://openai.com/) for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
209
  * Hugging Face 🤗 [Transformers](https://github.com/huggingface/transformers) for the model integration.