nexaml commited on
Commit
2ee1d4a
·
verified ·
1 Parent(s): 6835c06

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -16
README.md CHANGED
@@ -13,28 +13,74 @@ pipeline_tag: automatic-speech-recognition
13
  base_model: nvidia/parakeet-tdt-0.6b-v2
14
  ---
15
 
16
- # mlx-community/parakeet-tdt-0.6b-v2
17
 
18
- This model was converted to MLX format from [nvidia/parakeet-tdt-0.6b-v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) using [the conversion script](https://gist.github.com/senstella/77178bb5d6ec67bf8c54705a5f490bed). Please refer to [original model card](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) for more details on the model.
19
 
20
- ## Use with mlx
21
-
22
- ### parakeet-mlx
23
 
24
  ```bash
25
- pip install -U parakeet-mlx
26
  ```
27
 
28
- ```bash
29
- parakeet-mlx audio.wav --model mlx-community/parakeet-tdt-0.6b-v2
30
- ```
31
 
32
- ### mlx-audio
33
 
34
- ```bash
35
- pip install -U mlx-audio
36
- ```
37
 
38
- ```bash
39
- python -m mlx_audio.stt.generate --model mlx-community/parakeet-tdt-0.6b-v2 --audio audio.wav --output somewhere
40
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  base_model: nvidia/parakeet-tdt-0.6b-v2
14
  ---
15
 
16
+ # nexaml/parakeet-tdt-0.6b-v2-MLX
17
 
18
+ ## Quickstart
19
 
20
+ Run them directly with [nexa-sdk](https://github.com/NexaAI/nexa-sdk) installed
21
+ In nexa-sdk CLI:
 
22
 
23
  ```bash
24
+ nexaml/parakeet-tdt-0.6b-v2-MLX
25
  ```
26
 
27
+ ## Overview
 
 
28
 
29
+ `parakeet-tdt-0.6b-v2` is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/nvidia/parakeet-tdt-0.6b-v2
30
 
31
+ This XL variant of the FastConformer architecture integrates the TDT decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass. The model achieves an RTFx of 3380 on the HF-Open-ASR leaderboard with a batch size of 128. Note: *RTFx Performance may vary depending on dataset audio duration and batch size.*
 
 
32
 
33
+ **Key Features**
34
+ - Accurate word-level timestamp predictions
35
+ - Automatic punctuation and capitalization
36
+ - Robust performance on spoken numbers, and song lyrics transcription
37
+
38
+ For more information, refer to the [Model Architecture](#model-architecture) section and the [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
39
+
40
+ This model is ready for commercial/non-commercial use.
41
+
42
+
43
+ ## Benchmark Results
44
+
45
+ #### Huggingface Open-ASR-Leaderboard Performance
46
+ The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.
47
+
48
+ ### Base Performance
49
+ The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model):
50
+
51
+ | **Model** | **Avg WER** | **AMI** | **Earnings-22** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI Speech** | **TEDLIUM-v3** | **VoxPopuli** |
52
+ |:-------------|:-------------:|:---------:|:------------------:|:----------------:|:-----------------:|:-----------------:|:------------------:|:----------------:|:---------------:|
53
+ | parakeet-tdt-0.6b-v2 | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - |
54
+
55
+ ### Noise Robustness
56
+ Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples:
57
+
58
+ | **SNR Level** | **Avg WER** | **AMI** | **Earnings** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI** | **Tedlium** | **VoxPopuli** | **Relative Change** |
59
+ |:---------------|:-------------:|:----------:|:------------:|:----------------:|:-----------------:|:-----------------:|:-----------:|:-------------:|:---------------:|:-----------------:|
60
+ | Clean | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - |
61
+ | SNR 50 | 6.04 | 11.11 | 11.12 | 9.74 | 1.70 | 3.18 | 2.18 | 3.34 | 5.98 | +0.25% |
62
+ | SNR 25 | 6.50 | 12.76 | 11.50 | 9.98 | 1.78 | 3.63 | 2.54 | 3.46 | 6.34 | -7.04% |
63
+ | SNR 5 | 8.39 | 19.33 | 13.83 | 11.28 | 2.36 | 5.50 | 3.91 | 3.91 | 6.96 | -38.11% |
64
+
65
+ ### Telephony Audio Performance
66
+ Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion):
67
+
68
+ | **Audio Format** | **Avg WER** | **AMI** | **Earnings** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI** | **Tedlium** | **VoxPopuli** | **Relative Change** |
69
+ |:-----------------|:-------------:|:----------:|:------------:|:----------------:|:-----------------:|:-----------------:|:-----------:|:-------------:|:---------------:|:-----------------:|
70
+ | Standard 16kHz | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - |
71
+ | μ-law 8kHz | 6.32 | 11.98 | 11.16 | 10.02 | 1.78 | 3.52 | 2.20 | 3.38 | 6.52 | -4.10% |
72
+
73
+ These WER scores were obtained using greedy decoding without an external language model. Additional evaluation details are available on the [Hugging Face ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard).
74
+
75
+
76
+
77
+ ## Reference
78
+ - **Original model card**: [nvidia/parakeet-tdt-0.6b-v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2)
79
+ - [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
80
+ - [Efficient Sequence Transduction by Jointly Predicting Tokens and Durations](https://arxiv.org/abs/2304.06795)
81
+ - [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
82
+ - [Youtube-commons: A massive open corpus for conversational and multimodal data](https://huggingface.co/blog/Pclanglais/youtube-commons)
83
+ - [Yodas: Youtube-oriented dataset for audio and speech](https://arxiv.org/abs/2406.00899)
84
+ - [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
85
+ - [MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages](https://arxiv.org/abs/2410.01036)
86
+ - [Granary: Speech Recognition and Translation Dataset in 25 European Languages](https://arxiv.org/pdf/2505.13404)