Update README.md

6917bb2 verified about 1 month ago

5.26 kB

	---
	library_name: mlx
	tags:
	- mlx
	- automatic-speech-recognition
	- speech
	- audio
	- FastConformer
	- Conformer
	- Parakeet
	license: cc-by-4.0
	pipeline_tag: automatic-speech-recognition
	base_model: nvidia/parakeet-tdt-0.6b-v2
	---

	# NexaAI/parakeet-tdt-0.6b-v2-MLX

	## Quickstart

	Run them directly with [nexa-sdk](https://github.com/NexaAI/nexa-sdk) installed
	In nexa-sdk CLI:

	```bash
	NexaAI/parakeet-tdt-0.6b-v2-MLX
	```

	## Overview

	`parakeet-tdt-0.6b-v2` is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/nvidia/parakeet-tdt-0.6b-v2

	This XL variant of the FastConformer architecture integrates the TDT decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass. The model achieves an RTFx of 3380 on the HF-Open-ASR leaderboard with a batch size of 128. Note: RTFx Performance may vary depending on dataset audio duration and batch size.

	Key Features
	- Accurate word-level timestamp predictions
	- Automatic punctuation and capitalization
	- Robust performance on spoken numbers, and song lyrics transcription

	For more information, refer to the [Model Architecture](#model-architecture) section and the [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).

	This model is ready for commercial/non-commercial use.


	## Benchmark Results

	#### Huggingface Open-ASR-Leaderboard Performance
	The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.

	### Base Performance
	The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model):

	\| Model \| Avg WER \| AMI \| Earnings-22 \| GigaSpeech \| LS test-clean \| LS test-other \| SPGI Speech \| TEDLIUM-v3 \| VoxPopuli \|
	\|:-------------\|:-------------:\|:---------:\|:------------------:\|:----------------:\|:-----------------:\|:-----------------:\|:------------------:\|:----------------:\|:---------------:\|
	\| parakeet-tdt-0.6b-v2 \| 6.05 \| 11.16 \| 11.15 \| 9.74 \| 1.69 \| 3.19 \| 2.17 \| 3.38 \| 5.95 \| - \|

	### Noise Robustness
	Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples:

	\| SNR Level \| Avg WER \| AMI \| Earnings \| GigaSpeech \| LS test-clean \| LS test-other \| SPGI \| Tedlium \| VoxPopuli \| Relative Change \|
	\|:---------------\|:-------------:\|:----------:\|:------------:\|:----------------:\|:-----------------:\|:-----------------:\|:-----------:\|:-------------:\|:---------------:\|:-----------------:\|
	\| Clean \| 6.05 \| 11.16 \| 11.15 \| 9.74 \| 1.69 \| 3.19 \| 2.17 \| 3.38 \| 5.95 \| - \|
	\| SNR 50 \| 6.04 \| 11.11 \| 11.12 \| 9.74 \| 1.70 \| 3.18 \| 2.18 \| 3.34 \| 5.98 \| +0.25% \|
	\| SNR 25 \| 6.50 \| 12.76 \| 11.50 \| 9.98 \| 1.78 \| 3.63 \| 2.54 \| 3.46 \| 6.34 \| -7.04% \|
	\| SNR 5 \| 8.39 \| 19.33 \| 13.83 \| 11.28 \| 2.36 \| 5.50 \| 3.91 \| 3.91 \| 6.96 \| -38.11% \|

	### Telephony Audio Performance
	Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion):

	\| Audio Format \| Avg WER \| AMI \| Earnings \| GigaSpeech \| LS test-clean \| LS test-other \| SPGI \| Tedlium \| VoxPopuli \| Relative Change \|
	\|:-----------------\|:-------------:\|:----------:\|:------------:\|:----------------:\|:-----------------:\|:-----------------:\|:-----------:\|:-------------:\|:---------------:\|:-----------------:\|
	\| Standard 16kHz \| 6.05 \| 11.16 \| 11.15 \| 9.74 \| 1.69 \| 3.19 \| 2.17 \| 3.38 \| 5.95 \| - \|
	\| μ-law 8kHz \| 6.32 \| 11.98 \| 11.16 \| 10.02 \| 1.78 \| 3.52 \| 2.20 \| 3.38 \| 6.52 \| -4.10% \|

	These WER scores were obtained using greedy decoding without an external language model. Additional evaluation details are available on the [Hugging Face ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard).



	## Reference
	- Original model card: [nvidia/parakeet-tdt-0.6b-v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2)
	- [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
	- [Efficient Sequence Transduction by Jointly Predicting Tokens and Durations](https://arxiv.org/abs/2304.06795)
	- [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
	- [Youtube-commons: A massive open corpus for conversational and multimodal data](https://huggingface.co/blog/Pclanglais/youtube-commons)
	- [Yodas: Youtube-oriented dataset for audio and speech](https://arxiv.org/abs/2406.00899)
	- [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
	- [MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages](https://arxiv.org/abs/2410.01036)
	- [Granary: Speech Recognition and Translation Dataset in 25 European Languages](https://arxiv.org/pdf/2505.13404)