erastorgueva-nv commited on
Commit
022b742
·
1 Parent(s): 2e74e84

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -24
README.md CHANGED
@@ -36,7 +36,7 @@ model-index:
36
  metrics:
37
  - name: Dev WER
38
  type: wer
39
- value: 6.3
40
  - task:
41
  type: Automatic Speech Recognition
42
  name: speech-recognition
@@ -50,7 +50,7 @@ model-index:
50
  metrics:
51
  - name: Test WER
52
  type: wer
53
- value: 6.9
54
  - task:
55
  type: Automatic Speech Recognition
56
  name: automatic-speech-recognition
@@ -64,7 +64,7 @@ model-index:
64
  metrics:
65
  - name: Dev WER
66
  type: wer
67
- value: 4.3
68
  - task:
69
  type: Automatic Speech Recognition
70
  name: automatic-speech-recognition
@@ -78,7 +78,7 @@ model-index:
78
  metrics:
79
  - name: Test WER
80
  type: wer
81
- value: 4.2
82
 
83
  ---
84
  # NVIDIA Conformer-CTC Large (es)
@@ -91,12 +91,11 @@ img {
91
 
92
  | [![Model architecture](https://img.shields.io/badge/Model_Arch-Conformer--CTC-lightgrey#model-badge)](#model-architecture)
93
  | [![Model size](https://img.shields.io/badge/Params-120M-lightgrey#model-badge)](#model-architecture)
94
- | [![Language](https://img.shields.io/badge/Language-en--US-lightgrey#model-badge)](#datasets)
95
  | [![Riva Compatible](https://img.shields.io/badge/NVIDIA%20Riva-compatible-brightgreen#model-badge)](#deployment-with-nvidia-riva) |
96
 
97
 
98
- This model transcribes speech in lowercase English alphabet including spaces and apostrophes, and is trained on several thousand hours of English speech data.
99
- It is a non-autoregressive "large" variant of Conformer, with around 120 million parameters.
100
  See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc) for complete architecture details.
101
  It is also compatible with NVIDIA Riva for [production-grade server deployments](#deployment-with-nvidia-riva).
102
 
@@ -115,7 +114,7 @@ pip install nemo_toolkit['all']
115
 
116
  ```python
117
  import nemo.collections.asr as nemo_asr
118
- asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained("nvidia/stt_en_conformer_ctc_large")
119
  ```
120
 
121
  ### Transcribing using Python
@@ -132,7 +131,7 @@ asr_model.transcribe(['2086-149220-0033.wav'])
132
 
133
  ```shell
134
  python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
135
- pretrained_name="nvidia/stt_en_conformer_ctc_large"
136
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
137
  ```
138
 
@@ -154,24 +153,15 @@ The NeMo toolkit [3] was used for training the models for over several hundred e
154
 
155
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
156
 
157
- The checkpoint of the language model used as the neural rescorer can be found [here](https://ngc.nvidia.com/catalog/models/nvidia:nemo:asrlm_en_transformer_large_ls). You may find more info on how to train and use language models for ASR models here: [ASR Language Modeling](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html)
158
 
159
  ### Datasets
160
 
161
- All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of several thousand hours of English speech:
162
-
163
- - Librispeech 960 hours of English speech
164
- - Fisher Corpus
165
- - Switchboard-1 Dataset
166
- - WSJ-0 and WSJ-1
167
- - National Speech Corpus (Part 1, Part 6)
168
- - VCTK
169
- - VoxPopuli (EN)
170
- - Europarl-ASR (EN)
171
- - Multilingual Librispeech (MLS EN) - 2,000 hours subset
172
- - Mozilla Common Voice (v7.0)
173
-
174
- Note: older versions of the model may have trained on smaller set of datasets.
175
 
176
  ## Performance
177
 
 
36
  metrics:
37
  - name: Dev WER
38
  type: wer
39
+ value: 5.0
40
  - task:
41
  type: Automatic Speech Recognition
42
  name: speech-recognition
 
50
  metrics:
51
  - name: Test WER
52
  type: wer
53
+ value: 5.5
54
  - task:
55
  type: Automatic Speech Recognition
56
  name: automatic-speech-recognition
 
64
  metrics:
65
  - name: Dev WER
66
  type: wer
67
+ value: 3.6
68
  - task:
69
  type: Automatic Speech Recognition
70
  name: automatic-speech-recognition
 
78
  metrics:
79
  - name: Test WER
80
  type: wer
81
+ value: 3.6
82
 
83
  ---
84
  # NVIDIA Conformer-CTC Large (es)
 
91
 
92
  | [![Model architecture](https://img.shields.io/badge/Model_Arch-Conformer--CTC-lightgrey#model-badge)](#model-architecture)
93
  | [![Model size](https://img.shields.io/badge/Params-120M-lightgrey#model-badge)](#model-architecture)
94
+ | [![Language](https://img.shields.io/badge/Language-es-lightgrey#model-badge)](#datasets)
95
  | [![Riva Compatible](https://img.shields.io/badge/NVIDIA%20Riva-compatible-brightgreen#model-badge)](#deployment-with-nvidia-riva) |
96
 
97
 
98
+ This model transcribes speech in lowercase Spanish alphabet including spaces, and was trained on a composite dataset comprising of 1340 hours of Spanish speech. It is a non-autoregressive "large" variant of Conformer, with around 120 million parameters.
 
99
  See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc) for complete architecture details.
100
  It is also compatible with NVIDIA Riva for [production-grade server deployments](#deployment-with-nvidia-riva).
101
 
 
114
 
115
  ```python
116
  import nemo.collections.asr as nemo_asr
117
+ asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained("nvidia/stt_es_conformer_ctc_large")
118
  ```
119
 
120
  ### Transcribing using Python
 
131
 
132
  ```shell
133
  python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
134
+ pretrained_name="nvidia/stt_es_conformer_ctc_large"
135
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
136
  ```
137
 
 
153
 
154
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
155
 
156
+ The checkpoint of the language model used as the neural rescorer can be found [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_es_conformer_ctc_large/files). You may find more info on how to train and use language models for ASR models here: [ASR Language Modeling](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html)
157
 
158
  ### Datasets
159
 
160
+ All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of 1340 hours of Spanish speech:
161
+ - Mozilla Common Voice 7.0 (Spanish) - 289 hours after data cleaning
162
+ - Multilingual LibriSpeech (Spanish) - 801 hours after data cleaning
163
+ - Voxpopuli transcribed subset (Spanish) - 110 hours after data cleaning
164
+ - Fisher dataset (Spanish) - 140 hours after data cleaning
 
 
 
 
 
 
 
 
 
165
 
166
  ## Performance
167