Commit
·
022b742
1
Parent(s):
2e74e84
Update README.md
Browse files
README.md
CHANGED
|
@@ -36,7 +36,7 @@ model-index:
|
|
| 36 |
metrics:
|
| 37 |
- name: Dev WER
|
| 38 |
type: wer
|
| 39 |
-
value:
|
| 40 |
- task:
|
| 41 |
type: Automatic Speech Recognition
|
| 42 |
name: speech-recognition
|
|
@@ -50,7 +50,7 @@ model-index:
|
|
| 50 |
metrics:
|
| 51 |
- name: Test WER
|
| 52 |
type: wer
|
| 53 |
-
value:
|
| 54 |
- task:
|
| 55 |
type: Automatic Speech Recognition
|
| 56 |
name: automatic-speech-recognition
|
|
@@ -64,7 +64,7 @@ model-index:
|
|
| 64 |
metrics:
|
| 65 |
- name: Dev WER
|
| 66 |
type: wer
|
| 67 |
-
value:
|
| 68 |
- task:
|
| 69 |
type: Automatic Speech Recognition
|
| 70 |
name: automatic-speech-recognition
|
|
@@ -78,7 +78,7 @@ model-index:
|
|
| 78 |
metrics:
|
| 79 |
- name: Test WER
|
| 80 |
type: wer
|
| 81 |
-
value:
|
| 82 |
|
| 83 |
---
|
| 84 |
# NVIDIA Conformer-CTC Large (es)
|
|
@@ -91,12 +91,11 @@ img {
|
|
| 91 |
|
| 92 |
| [](#model-architecture)
|
| 93 |
| [](#model-architecture)
|
| 94 |
-
| [](#deployment-with-nvidia-riva) |
|
| 96 |
|
| 97 |
|
| 98 |
-
This model transcribes speech in lowercase
|
| 99 |
-
It is a non-autoregressive "large" variant of Conformer, with around 120 million parameters.
|
| 100 |
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc) for complete architecture details.
|
| 101 |
It is also compatible with NVIDIA Riva for [production-grade server deployments](#deployment-with-nvidia-riva).
|
| 102 |
|
|
@@ -115,7 +114,7 @@ pip install nemo_toolkit['all']
|
|
| 115 |
|
| 116 |
```python
|
| 117 |
import nemo.collections.asr as nemo_asr
|
| 118 |
-
asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained("nvidia/
|
| 119 |
```
|
| 120 |
|
| 121 |
### Transcribing using Python
|
|
@@ -132,7 +131,7 @@ asr_model.transcribe(['2086-149220-0033.wav'])
|
|
| 132 |
|
| 133 |
```shell
|
| 134 |
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
|
| 135 |
-
pretrained_name="nvidia/
|
| 136 |
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
|
| 137 |
```
|
| 138 |
|
|
@@ -154,24 +153,15 @@ The NeMo toolkit [3] was used for training the models for over several hundred e
|
|
| 154 |
|
| 155 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
| 156 |
|
| 157 |
-
The checkpoint of the language model used as the neural rescorer can be found [here](https://ngc.nvidia.com/
|
| 158 |
|
| 159 |
### Datasets
|
| 160 |
|
| 161 |
-
All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of
|
| 162 |
-
|
| 163 |
-
-
|
| 164 |
-
-
|
| 165 |
-
-
|
| 166 |
-
- WSJ-0 and WSJ-1
|
| 167 |
-
- National Speech Corpus (Part 1, Part 6)
|
| 168 |
-
- VCTK
|
| 169 |
-
- VoxPopuli (EN)
|
| 170 |
-
- Europarl-ASR (EN)
|
| 171 |
-
- Multilingual Librispeech (MLS EN) - 2,000 hours subset
|
| 172 |
-
- Mozilla Common Voice (v7.0)
|
| 173 |
-
|
| 174 |
-
Note: older versions of the model may have trained on smaller set of datasets.
|
| 175 |
|
| 176 |
## Performance
|
| 177 |
|
|
|
|
| 36 |
metrics:
|
| 37 |
- name: Dev WER
|
| 38 |
type: wer
|
| 39 |
+
value: 5.0
|
| 40 |
- task:
|
| 41 |
type: Automatic Speech Recognition
|
| 42 |
name: speech-recognition
|
|
|
|
| 50 |
metrics:
|
| 51 |
- name: Test WER
|
| 52 |
type: wer
|
| 53 |
+
value: 5.5
|
| 54 |
- task:
|
| 55 |
type: Automatic Speech Recognition
|
| 56 |
name: automatic-speech-recognition
|
|
|
|
| 64 |
metrics:
|
| 65 |
- name: Dev WER
|
| 66 |
type: wer
|
| 67 |
+
value: 3.6
|
| 68 |
- task:
|
| 69 |
type: Automatic Speech Recognition
|
| 70 |
name: automatic-speech-recognition
|
|
|
|
| 78 |
metrics:
|
| 79 |
- name: Test WER
|
| 80 |
type: wer
|
| 81 |
+
value: 3.6
|
| 82 |
|
| 83 |
---
|
| 84 |
# NVIDIA Conformer-CTC Large (es)
|
|
|
|
| 91 |
|
| 92 |
| [](#model-architecture)
|
| 93 |
| [](#model-architecture)
|
| 94 |
+
| [](#datasets)
|
| 95 |
| [](#deployment-with-nvidia-riva) |
|
| 96 |
|
| 97 |
|
| 98 |
+
This model transcribes speech in lowercase Spanish alphabet including spaces, and was trained on a composite dataset comprising of 1340 hours of Spanish speech. It is a non-autoregressive "large" variant of Conformer, with around 120 million parameters.
|
|
|
|
| 99 |
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc) for complete architecture details.
|
| 100 |
It is also compatible with NVIDIA Riva for [production-grade server deployments](#deployment-with-nvidia-riva).
|
| 101 |
|
|
|
|
| 114 |
|
| 115 |
```python
|
| 116 |
import nemo.collections.asr as nemo_asr
|
| 117 |
+
asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained("nvidia/stt_es_conformer_ctc_large")
|
| 118 |
```
|
| 119 |
|
| 120 |
### Transcribing using Python
|
|
|
|
| 131 |
|
| 132 |
```shell
|
| 133 |
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
|
| 134 |
+
pretrained_name="nvidia/stt_es_conformer_ctc_large"
|
| 135 |
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
|
| 136 |
```
|
| 137 |
|
|
|
|
| 153 |
|
| 154 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
| 155 |
|
| 156 |
+
The checkpoint of the language model used as the neural rescorer can be found [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_es_conformer_ctc_large/files). You may find more info on how to train and use language models for ASR models here: [ASR Language Modeling](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html)
|
| 157 |
|
| 158 |
### Datasets
|
| 159 |
|
| 160 |
+
All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of 1340 hours of Spanish speech:
|
| 161 |
+
- Mozilla Common Voice 7.0 (Spanish) - 289 hours after data cleaning
|
| 162 |
+
- Multilingual LibriSpeech (Spanish) - 801 hours after data cleaning
|
| 163 |
+
- Voxpopuli transcribed subset (Spanish) - 110 hours after data cleaning
|
| 164 |
+
- Fisher dataset (Spanish) - 140 hours after data cleaning
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 165 |
|
| 166 |
## Performance
|
| 167 |
|