Update README.md
Browse files
README.md
CHANGED
@@ -13,28 +13,74 @@ pipeline_tag: automatic-speech-recognition
|
|
13 |
base_model: nvidia/parakeet-tdt-0.6b-v2
|
14 |
---
|
15 |
|
16 |
-
#
|
17 |
|
18 |
-
|
19 |
|
20 |
-
|
21 |
-
|
22 |
-
### parakeet-mlx
|
23 |
|
24 |
```bash
|
25 |
-
|
26 |
```
|
27 |
|
28 |
-
|
29 |
-
parakeet-mlx audio.wav --model mlx-community/parakeet-tdt-0.6b-v2
|
30 |
-
```
|
31 |
|
32 |
-
|
33 |
|
34 |
-
|
35 |
-
pip install -U mlx-audio
|
36 |
-
```
|
37 |
|
38 |
-
|
39 |
-
|
40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
base_model: nvidia/parakeet-tdt-0.6b-v2
|
14 |
---
|
15 |
|
16 |
+
# nexaml/parakeet-tdt-0.6b-v2-MLX
|
17 |
|
18 |
+
## Quickstart
|
19 |
|
20 |
+
Run them directly with [nexa-sdk](https://github.com/NexaAI/nexa-sdk) installed
|
21 |
+
In nexa-sdk CLI:
|
|
|
22 |
|
23 |
```bash
|
24 |
+
nexaml/parakeet-tdt-0.6b-v2-MLX
|
25 |
```
|
26 |
|
27 |
+
## Overview
|
|
|
|
|
28 |
|
29 |
+
`parakeet-tdt-0.6b-v2` is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/nvidia/parakeet-tdt-0.6b-v2
|
30 |
|
31 |
+
This XL variant of the FastConformer architecture integrates the TDT decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass. The model achieves an RTFx of 3380 on the HF-Open-ASR leaderboard with a batch size of 128. Note: *RTFx Performance may vary depending on dataset audio duration and batch size.*
|
|
|
|
|
32 |
|
33 |
+
**Key Features**
|
34 |
+
- Accurate word-level timestamp predictions
|
35 |
+
- Automatic punctuation and capitalization
|
36 |
+
- Robust performance on spoken numbers, and song lyrics transcription
|
37 |
+
|
38 |
+
For more information, refer to the [Model Architecture](#model-architecture) section and the [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
|
39 |
+
|
40 |
+
This model is ready for commercial/non-commercial use.
|
41 |
+
|
42 |
+
|
43 |
+
## Benchmark Results
|
44 |
+
|
45 |
+
#### Huggingface Open-ASR-Leaderboard Performance
|
46 |
+
The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.
|
47 |
+
|
48 |
+
### Base Performance
|
49 |
+
The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model):
|
50 |
+
|
51 |
+
| **Model** | **Avg WER** | **AMI** | **Earnings-22** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI Speech** | **TEDLIUM-v3** | **VoxPopuli** |
|
52 |
+
|:-------------|:-------------:|:---------:|:------------------:|:----------------:|:-----------------:|:-----------------:|:------------------:|:----------------:|:---------------:|
|
53 |
+
| parakeet-tdt-0.6b-v2 | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - |
|
54 |
+
|
55 |
+
### Noise Robustness
|
56 |
+
Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples:
|
57 |
+
|
58 |
+
| **SNR Level** | **Avg WER** | **AMI** | **Earnings** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI** | **Tedlium** | **VoxPopuli** | **Relative Change** |
|
59 |
+
|:---------------|:-------------:|:----------:|:------------:|:----------------:|:-----------------:|:-----------------:|:-----------:|:-------------:|:---------------:|:-----------------:|
|
60 |
+
| Clean | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - |
|
61 |
+
| SNR 50 | 6.04 | 11.11 | 11.12 | 9.74 | 1.70 | 3.18 | 2.18 | 3.34 | 5.98 | +0.25% |
|
62 |
+
| SNR 25 | 6.50 | 12.76 | 11.50 | 9.98 | 1.78 | 3.63 | 2.54 | 3.46 | 6.34 | -7.04% |
|
63 |
+
| SNR 5 | 8.39 | 19.33 | 13.83 | 11.28 | 2.36 | 5.50 | 3.91 | 3.91 | 6.96 | -38.11% |
|
64 |
+
|
65 |
+
### Telephony Audio Performance
|
66 |
+
Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion):
|
67 |
+
|
68 |
+
| **Audio Format** | **Avg WER** | **AMI** | **Earnings** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI** | **Tedlium** | **VoxPopuli** | **Relative Change** |
|
69 |
+
|:-----------------|:-------------:|:----------:|:------------:|:----------------:|:-----------------:|:-----------------:|:-----------:|:-------------:|:---------------:|:-----------------:|
|
70 |
+
| Standard 16kHz | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - |
|
71 |
+
| μ-law 8kHz | 6.32 | 11.98 | 11.16 | 10.02 | 1.78 | 3.52 | 2.20 | 3.38 | 6.52 | -4.10% |
|
72 |
+
|
73 |
+
These WER scores were obtained using greedy decoding without an external language model. Additional evaluation details are available on the [Hugging Face ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard).
|
74 |
+
|
75 |
+
|
76 |
+
|
77 |
+
## Reference
|
78 |
+
- **Original model card**: [nvidia/parakeet-tdt-0.6b-v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2)
|
79 |
+
- [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
|
80 |
+
- [Efficient Sequence Transduction by Jointly Predicting Tokens and Durations](https://arxiv.org/abs/2304.06795)
|
81 |
+
- [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
|
82 |
+
- [Youtube-commons: A massive open corpus for conversational and multimodal data](https://huggingface.co/blog/Pclanglais/youtube-commons)
|
83 |
+
- [Yodas: Youtube-oriented dataset for audio and speech](https://arxiv.org/abs/2406.00899)
|
84 |
+
- [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
|
85 |
+
- [MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages](https://arxiv.org/abs/2410.01036)
|
86 |
+
- [Granary: Speech Recognition and Translation Dataset in 25 European Languages](https://arxiv.org/pdf/2505.13404)
|