Update README.md
Browse files
README.md
CHANGED
|
@@ -103,12 +103,14 @@ Conformer-CTC model is a non-autoregressive variant of Conformer model [1] for A
|
|
| 103 |
|
| 104 |
The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_ctc_bpe.yaml).
|
| 105 |
|
| 106 |
-
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
| 107 |
-
|
| 108 |
The vocabulary we use contains 28 characters:
|
| 109 |
```python
|
| 110 |
[' ', "'", 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
|
| 111 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
Full config can be found inside the .nemo files.
|
| 114 |
|
|
|
|
| 103 |
|
| 104 |
The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_ctc_bpe.yaml).
|
| 105 |
|
|
|
|
|
|
|
| 106 |
The vocabulary we use contains 28 characters:
|
| 107 |
```python
|
| 108 |
[' ', "'", 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
|
| 109 |
```
|
| 110 |
+
Rare symbols with diacritics were replaced during preprocessing.
|
| 111 |
+
|
| 112 |
+
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
| 113 |
+
For vocabulary of size 128 we restrict maximum subtoken length to 2 symbols to avoid populating vocabulary with specific frequent words from the dataset. This does not affect the model performance and potentially helps to adapt to other domain without retraining tokenizer.
|
| 114 |
|
| 115 |
Full config can be found inside the .nemo files.
|
| 116 |
|