nvidia
/

stt_es_conformer_ctc_large

@@ -36,7 +36,7 @@ model-index:
     metrics:
     - name: Dev WER
       type: wer
-      value: 6.3
   - task:
       type: Automatic Speech Recognition
       name: speech-recognition
@@ -50,7 +50,7 @@ model-index:
     metrics:
     - name: Test WER
       type: wer
-      value: 6.9
   - task:
       type: Automatic Speech Recognition
       name: automatic-speech-recognition
@@ -64,7 +64,7 @@ model-index:
     metrics:
     - name: Dev WER
       type: wer
-      value: 4.3
   - task:
       type: Automatic Speech Recognition
       name: automatic-speech-recognition
@@ -78,7 +78,7 @@ model-index:
     metrics:
     - name: Test WER
       type: wer
-      value: 4.2
 ---
 # NVIDIA Conformer-CTC Large (es)
@@ -91,12 +91,11 @@ img {
 | [![Model architecture](https://img.shields.io/badge/Model_Arch-Conformer--CTC-lightgrey#model-badge)](#model-architecture)
 | [![Model size](https://img.shields.io/badge/Params-120M-lightgrey#model-badge)](#model-architecture)
-| [![Language](https://img.shields.io/badge/Language-en--US-lightgrey#model-badge)](#datasets)
 | [![Riva Compatible](https://img.shields.io/badge/NVIDIA%20Riva-compatible-brightgreen#model-badge)](#deployment-with-nvidia-riva) |
-This model transcribes speech in lowercase English alphabet including spaces and apostrophes, and is trained on several thousand hours of English speech data.
-It is a non-autoregressive "large" variant of Conformer, with around 120 million parameters.
 See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc) for complete architecture details.
 It is also compatible with NVIDIA Riva for [production-grade server deployments](#deployment-with-nvidia-riva).
@@ -115,7 +114,7 @@ pip install nemo_toolkit['all']
 ```python
 import nemo.collections.asr as nemo_asr
-asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained("nvidia/stt_en_conformer_ctc_large")
 ```
 ### Transcribing using Python
@@ -132,7 +131,7 @@ asr_model.transcribe(['2086-149220-0033.wav'])
 ```shell
 python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
- pretrained_name="nvidia/stt_en_conformer_ctc_large"
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
 ```
@@ -154,24 +153,15 @@ The NeMo toolkit [3] was used for training the models for over several hundred e
 The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
-The checkpoint of the language model used as the neural rescorer can be found [here](https://ngc.nvidia.com/catalog/models/nvidia:nemo:asrlm_en_transformer_large_ls). You may find more info on how to train and use language models for ASR models here: [ASR Language Modeling](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html)
 ### Datasets
-All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of several thousand hours of English speech:
-- Librispeech 960 hours of English speech
-- Fisher Corpus
-- Switchboard-1 Dataset
-- WSJ-0 and WSJ-1
-- National Speech Corpus (Part 1, Part 6)
-- VCTK
-- VoxPopuli (EN)
-- Europarl-ASR (EN)
-- Multilingual Librispeech (MLS EN) - 2,000 hours subset
-- Mozilla Common Voice (v7.0)
-Note: older versions of the model may have trained on smaller set of datasets.
 ## Performance

     metrics:
     - name: Dev WER
       type: wer
+      value: 5.0
   - task:
       type: Automatic Speech Recognition
       name: speech-recognition
     metrics:
     - name: Test WER
       type: wer
+      value: 5.5
   - task:
       type: Automatic Speech Recognition
       name: automatic-speech-recognition
     metrics:
     - name: Dev WER
       type: wer
+      value: 3.6
   - task:
       type: Automatic Speech Recognition
       name: automatic-speech-recognition
     metrics:
     - name: Test WER
       type: wer
+      value: 3.6
 ---
 # NVIDIA Conformer-CTC Large (es)
 | [![Model architecture](https://img.shields.io/badge/Model_Arch-Conformer--CTC-lightgrey#model-badge)](#model-architecture)
 | [![Model size](https://img.shields.io/badge/Params-120M-lightgrey#model-badge)](#model-architecture)
+| [![Language](https://img.shields.io/badge/Language-es-lightgrey#model-badge)](#datasets)
 | [![Riva Compatible](https://img.shields.io/badge/NVIDIA%20Riva-compatible-brightgreen#model-badge)](#deployment-with-nvidia-riva) |
+This model transcribes speech in lowercase Spanish alphabet including spaces, and was trained on a composite dataset comprising of 1340 hours of Spanish speech. It is a non-autoregressive "large" variant of Conformer, with around 120 million parameters.
 See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc) for complete architecture details.
 It is also compatible with NVIDIA Riva for [production-grade server deployments](#deployment-with-nvidia-riva).
 ```python
 import nemo.collections.asr as nemo_asr
+asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained("nvidia/stt_es_conformer_ctc_large")
 ```
 ### Transcribing using Python
 ```shell
 python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
+ pretrained_name="nvidia/stt_es_conformer_ctc_large"
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
 ```
 The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
+The checkpoint of the language model used as the neural rescorer can be found [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_es_conformer_ctc_large/files). You may find more info on how to train and use language models for ASR models here: [ASR Language Modeling](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html)
 ### Datasets
+All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of 1340 hours of Spanish speech:
+- Mozilla Common Voice 7.0 (Spanish) - 289 hours after data cleaning
+- Multilingual LibriSpeech (Spanish) - 801 hours after data cleaning
+- Voxpopuli transcribed subset (Spanish) - 110 hours after data cleaning
+- Fisher dataset (Spanish) - 140 hours after data cleaning
 ## Performance