Update README.md (#6)
Browse files- Update README.md (cb8e9db32f0d568ec167b39d8283a7d790de2f7d)
Co-authored-by: He Huang <[email protected]>
README.md
CHANGED
|
@@ -304,7 +304,7 @@ canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b')
|
|
| 304 |
|
| 305 |
# update dcode params
|
| 306 |
decode_cfg = canary_model.cfg.decoding
|
| 307 |
-
decode_cfg.beam.beam_size =
|
| 308 |
canary_model.change_decoding_strategy(decode_cfg)
|
| 309 |
```
|
| 310 |
|
|
@@ -332,10 +332,10 @@ Another recommended option is to use a json manifest as input, where each line i
|
|
| 332 |
{
|
| 333 |
"audio_filepath": "/path/to/audio.wav", # path to the audio file
|
| 334 |
"duration": 10000.0, # duration of the audio
|
| 335 |
-
"taskname": "asr", # use "
|
| 336 |
-
"source_lang": "en", # Set `source_lang
|
| 337 |
-
"target_lang": "
|
| 338 |
-
"pnc": yes, # whether to have PnC output, choices=['yes', 'no']
|
| 339 |
}
|
| 340 |
```
|
| 341 |
|
|
@@ -367,7 +367,7 @@ An example manifest for transcribing English audios can be:
|
|
| 367 |
"taskname": "asr",
|
| 368 |
"source_lang": "en",
|
| 369 |
"target_lang": "en",
|
| 370 |
-
"pnc": yes, # whether to have PnC output, choices=['yes', 'no']
|
| 371 |
}
|
| 372 |
```
|
| 373 |
|
|
@@ -381,10 +381,10 @@ An example manifest for transcribing English audios into German text can be:
|
|
| 381 |
{
|
| 382 |
"audio_filepath": "/path/to/audio.wav", # path to the audio file
|
| 383 |
"duration": 10000.0, # duration of the audio
|
| 384 |
-
"taskname": "
|
| 385 |
"source_lang": "en",
|
| 386 |
"target_lang": "de",
|
| 387 |
-
"pnc": yes, # whether to have PnC output, choices=['yes', 'no']
|
| 388 |
}
|
| 389 |
```
|
| 390 |
|
|
@@ -401,7 +401,8 @@ The model outputs the transcribed/translated text corresponding to the input aud
|
|
| 401 |
|
| 402 |
## Training
|
| 403 |
|
| 404 |
-
Canary-1B is trained using the NVIDIA NeMo toolkit [4] for 150k steps with dynamic bucketing and a batch duration of 360s per GPU on 128 NVIDIA A100 80GB GPUs
|
|
|
|
| 405 |
|
| 406 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
| 407 |
|
|
@@ -410,6 +411,38 @@ The tokenizers for these models were built using the text transcripts of the tra
|
|
| 410 |
|
| 411 |
The Canary-1B model is trained on a total of 85k hrs of speech data. It consists of 31k hrs of public data, 20k hrs collected by [Suno](https://suno.ai/), and 34k hrs of in-house data.
|
| 412 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 413 |
|
| 414 |
## Performance
|
| 415 |
|
|
@@ -417,23 +450,47 @@ In both ASR and AST experiments, predictions were generated using beam search wi
|
|
| 417 |
|
| 418 |
### ASR Performance (w/o PnC)
|
| 419 |
|
| 420 |
-
The ASR performance is measured with word error rate (WER)
|
| 421 |
|
|
|
|
| 422 |
|
| 423 |
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
| 424 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
|
| 425 |
| 1.23.0 | canary-1b | 7.97 | 4.61 | 3.99 | 6.53 |
|
| 426 |
|
| 427 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 428 |
More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
|
| 429 |
|
| 430 |
### AST Performance
|
| 431 |
|
| 432 |
-
We evaluate AST performance with BLEU score
|
|
|
|
|
|
|
| 433 |
|
| 434 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
| 435 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
| 436 |
-
| 1.23.0 | canary-1b | 22.66
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 437 |
|
| 438 |
|
| 439 |
## NVIDIA Riva: Deployment
|
|
|
|
| 304 |
|
| 305 |
# update dcode params
|
| 306 |
decode_cfg = canary_model.cfg.decoding
|
| 307 |
+
decode_cfg.beam.beam_size = 1
|
| 308 |
canary_model.change_decoding_strategy(decode_cfg)
|
| 309 |
```
|
| 310 |
|
|
|
|
| 332 |
{
|
| 333 |
"audio_filepath": "/path/to/audio.wav", # path to the audio file
|
| 334 |
"duration": 10000.0, # duration of the audio
|
| 335 |
+
"taskname": "asr", # use "ast" for speech-to-text translation
|
| 336 |
+
"source_lang": "en", # Set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
|
| 337 |
+
"target_lang": "en", # Language of the text output, choices=['en','de','es','fr']
|
| 338 |
+
"pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
|
| 339 |
}
|
| 340 |
```
|
| 341 |
|
|
|
|
| 367 |
"taskname": "asr",
|
| 368 |
"source_lang": "en",
|
| 369 |
"target_lang": "en",
|
| 370 |
+
"pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
|
| 371 |
}
|
| 372 |
```
|
| 373 |
|
|
|
|
| 381 |
{
|
| 382 |
"audio_filepath": "/path/to/audio.wav", # path to the audio file
|
| 383 |
"duration": 10000.0, # duration of the audio
|
| 384 |
+
"taskname": "ast",
|
| 385 |
"source_lang": "en",
|
| 386 |
"target_lang": "de",
|
| 387 |
+
"pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
|
| 388 |
}
|
| 389 |
```
|
| 390 |
|
|
|
|
| 401 |
|
| 402 |
## Training
|
| 403 |
|
| 404 |
+
Canary-1B is trained using the NVIDIA NeMo toolkit [4] for 150k steps with dynamic bucketing and a batch duration of 360s per GPU on 128 NVIDIA A100 80GB GPUs.
|
| 405 |
+
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/canary-2/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/canary-2/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
|
| 406 |
|
| 407 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
| 408 |
|
|
|
|
| 411 |
|
| 412 |
The Canary-1B model is trained on a total of 85k hrs of speech data. It consists of 31k hrs of public data, 20k hrs collected by [Suno](https://suno.ai/), and 34k hrs of in-house data.
|
| 413 |
|
| 414 |
+
The constituents of public data are as follows.
|
| 415 |
+
|
| 416 |
+
#### English (25.5k hours)
|
| 417 |
+
- Librispeech 960 hours
|
| 418 |
+
- Fisher Corpus
|
| 419 |
+
- Switchboard-1 Dataset
|
| 420 |
+
- WSJ-0 and WSJ-1
|
| 421 |
+
- National Speech Corpus (Part 1, Part 6)
|
| 422 |
+
- VCTK
|
| 423 |
+
- VoxPopuli (EN)
|
| 424 |
+
- Europarl-ASR (EN)
|
| 425 |
+
- Multilingual Librispeech (MLS EN) - 2,000 hour subset
|
| 426 |
+
- Mozilla Common Voice (v7.0)
|
| 427 |
+
- People's Speech - 12,000 hour subset
|
| 428 |
+
- Mozilla Common Voice (v11.0) - 1,474 hour subset
|
| 429 |
+
|
| 430 |
+
#### German (2.5k hours)
|
| 431 |
+
- Mozilla Common Voice (v12.0) - 800 hour subset
|
| 432 |
+
- Multilingual Librispeech (MLS DE) - 1,500 hour subset
|
| 433 |
+
- VoxPopuli (DE) - 200 hr subset
|
| 434 |
+
|
| 435 |
+
#### Spanish (1.4k hours)
|
| 436 |
+
- Mozilla Common Voice (v12.0) - 395 hour subset
|
| 437 |
+
- Multilingual Librispeech (MLS ES) - 780 hour subset
|
| 438 |
+
- VoxPopuli (ES) - 108 hour subset
|
| 439 |
+
- Fisher - 141 hour subset
|
| 440 |
+
|
| 441 |
+
#### French (1.8k hours)
|
| 442 |
+
- Mozilla Common Voice (v12.0) - 708 hour subset
|
| 443 |
+
- Multilingual Librispeech (MLS FR) - 926 hour subset
|
| 444 |
+
- VoxPopuli (FR) - 165 hour subset
|
| 445 |
+
|
| 446 |
|
| 447 |
## Performance
|
| 448 |
|
|
|
|
| 450 |
|
| 451 |
### ASR Performance (w/o PnC)
|
| 452 |
|
| 453 |
+
The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
|
| 454 |
|
| 455 |
+
WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
|
| 456 |
|
| 457 |
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
| 458 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
|
| 459 |
| 1.23.0 | canary-1b | 7.97 | 4.61 | 3.99 | 6.53 |
|
| 460 |
|
| 461 |
|
| 462 |
+
WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
|
| 463 |
+
|
| 464 |
+
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
| 465 |
+
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
|
| 466 |
+
| 1.23.0 | canary-1b | 3.06 | 4.19 | 3.15 | 4.12 |
|
| 467 |
+
|
| 468 |
+
|
| 469 |
More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
|
| 470 |
|
| 471 |
### AST Performance
|
| 472 |
|
| 473 |
+
We evaluate AST performance with BLEU score and use their native annotations with punctuation and capitalization.
|
| 474 |
+
|
| 475 |
+
BLEU score on [FLEURS](https://huggingface.co/datasets/google/fleurs) test set:
|
| 476 |
|
| 477 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
| 478 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
| 479 |
+
| 1.23.0 | canary-1b | 22.66 | 41.11 | 40.76 | 32.64 | 32.15 | 23.57 |
|
| 480 |
+
|
| 481 |
+
|
| 482 |
+
BLEU score on [COVOST-v2](https://github.com/facebookresearch/covost) test set:
|
| 483 |
+
|
| 484 |
+
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
| 485 |
+
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
| 486 |
+
| 1.23.0 | canary-1b | 37.67 | 40.7 | 40.42 |
|
| 487 |
+
|
| 488 |
+
BLEU score on [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
|
| 489 |
+
|
| 490 |
+
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
| 491 |
+
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
| 492 |
+
| 1.23.0 | canary-1b | 23.84 | 35.74 | 28.29 |
|
| 493 |
+
|
| 494 |
|
| 495 |
|
| 496 |
## NVIDIA Riva: Deployment
|