Add reference to the cache-aware paper. (#2)
Browse files- Add reference to the cache-aware paper. (7a4ebdeb9844203deb5ceea1631603a3ed32c949)
Co-authored-by: Vahid Noroozi <[email protected]>
README.md
CHANGED
|
@@ -11,9 +11,9 @@ datasets:
|
|
| 11 |
- National-Singapore-Corpus-Part-1
|
| 12 |
- National-Singapore-Corpus-Part-6
|
| 13 |
- vctk
|
| 14 |
-
- VoxPopuli-
|
| 15 |
-
- Europarl-ASR-
|
| 16 |
-
- Multilingual-LibriSpeech-
|
| 17 |
- mozilla-foundation/common_voice_8_0
|
| 18 |
- MLCommons/peoples_speech
|
| 19 |
thumbnail: null
|
|
@@ -66,19 +66,19 @@ img {
|
|
| 66 |
|
| 67 |
This collection contains large-size versions of cache-aware FastConformer-Hybrid (around 114M parameters) with multiple look-ahead support, trained on a large scale english speech.
|
| 68 |
These models are trained for streaming ASR, which be used for streaming applications with a variety of latencies (0ms, 80ms, 480s, 1040ms).
|
| 69 |
-
These are the worst latency and average latency of the model for each case would be half of these numbers.
|
| 70 |
|
| 71 |
|
| 72 |
## Model Architecture
|
| 73 |
|
| 74 |
-
These models are cache-aware versions of Hybrid FastConfomer which are trained for streaming ASR. You may find more info on cache-aware models here: [Cache-aware Streaming Conformer](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#cache-aware-streaming-conformer).
|
| 75 |
The models are trained with multiple look-aheads which makes the model to be able to support different latencies.
|
| 76 |
To learn on how to switch between different look-ahead, you may read the documentation on the cache-aware models.
|
| 77 |
|
| 78 |
FastConformer [4] is an optimized version of the Conformer model [1], and
|
| 79 |
you may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
|
| 80 |
|
| 81 |
-
The model is trained in a multitask setup with joint Transducer and CTC decoder loss. You can find more about Hybrid Transducer-CTC training here: [Hybrid Transducer-CTC](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#hybrid-transducer-ctc).
|
| 82 |
You may also find more on how to switch between the Transducer and CTC decoders in the documentation.
|
| 83 |
|
| 84 |
|
|
@@ -226,3 +226,6 @@ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
|
|
| 226 |
[3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
|
| 227 |
|
| 228 |
[4] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
- National-Singapore-Corpus-Part-1
|
| 12 |
- National-Singapore-Corpus-Part-6
|
| 13 |
- vctk
|
| 14 |
+
- VoxPopuli-EN
|
| 15 |
+
- Europarl-ASR-EN
|
| 16 |
+
- Multilingual-LibriSpeech-2000hours
|
| 17 |
- mozilla-foundation/common_voice_8_0
|
| 18 |
- MLCommons/peoples_speech
|
| 19 |
thumbnail: null
|
|
|
|
| 66 |
|
| 67 |
This collection contains large-size versions of cache-aware FastConformer-Hybrid (around 114M parameters) with multiple look-ahead support, trained on a large scale english speech.
|
| 68 |
These models are trained for streaming ASR, which be used for streaming applications with a variety of latencies (0ms, 80ms, 480s, 1040ms).
|
| 69 |
+
These are the worst latency and average latency of the model for each case would be half of these numbers. You may find more detail and evalution results [here](https://arxiv.org/abs/2312.17279) [5].
|
| 70 |
|
| 71 |
|
| 72 |
## Model Architecture
|
| 73 |
|
| 74 |
+
These models are cache-aware versions of Hybrid FastConfomer which are trained for streaming ASR. You may find more info on cache-aware models here: [Cache-aware Streaming Conformer](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#cache-aware-streaming-conformer) [5].
|
| 75 |
The models are trained with multiple look-aheads which makes the model to be able to support different latencies.
|
| 76 |
To learn on how to switch between different look-ahead, you may read the documentation on the cache-aware models.
|
| 77 |
|
| 78 |
FastConformer [4] is an optimized version of the Conformer model [1], and
|
| 79 |
you may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
|
| 80 |
|
| 81 |
+
The model is trained in a multitask setup with joint Transducer and CTC decoder loss [5]. You can find more about Hybrid Transducer-CTC training here: [Hybrid Transducer-CTC](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#hybrid-transducer-ctc).
|
| 82 |
You may also find more on how to switch between the Transducer and CTC decoders in the documentation.
|
| 83 |
|
| 84 |
|
|
|
|
| 226 |
[3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
|
| 227 |
|
| 228 |
[4] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
|
| 229 |
+
|
| 230 |
+
[5] [Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition
|
| 231 |
+
](https://arxiv.org/abs/2312.17279)
|