nvidia
/

stt_rw_conformer_ctc_large

Automatic Speech Recognition

hf-asr-leaderboard

Model card Files Files and versions

bene-ges commited on Aug 3, 2022

Commit

a2f7ba6

·

1 Parent(s): f75ab88

Update README.md

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -103,12 +103,14 @@ Conformer-CTC model is a non-autoregressive variant of Conformer model [1] for A
 The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_ctc_bpe.yaml).
-The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
 The vocabulary we use contains 28 characters:
 ```python
 [' ', "'", 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
 ```
 Full config can be found inside the .nemo files.

 The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_ctc_bpe.yaml).
 The vocabulary we use contains 28 characters:
 ```python
 [' ', "'", 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
 ```
+Rare symbols with diacritics were replaced during preprocessing.
+The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
+For vocabulary of size 128 we restrict maximum subtoken length to 2 symbols to avoid populating vocabulary with specific frequent words from the dataset. This does not affect the model performance and potentially helps to adapt to other domain without retraining tokenizer.
 Full config can be found inside the .nemo files.