metadata

library_name: mlx
tags:
  - mlx
  - automatic-speech-recognition
  - speech
  - audio
  - FastConformer
  - Conformer
  - Parakeet
license: cc-by-4.0
pipeline_tag: automatic-speech-recognition
base_model: nvidia/parakeet-tdt-0.6b-v2

NexaAI/parakeet-tdt-0.6b-v2-MLX

Quickstart

Run them directly with nexa-sdk installed In nexa-sdk CLI:

NexaAI/parakeet-tdt-0.6b-v2-MLX

Overview

parakeet-tdt-0.6b-v2 is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/nvidia/parakeet-tdt-0.6b-v2

This XL variant of the FastConformer architecture integrates the TDT decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass. The model achieves an RTFx of 3380 on the HF-Open-ASR leaderboard with a batch size of 128. Note: RTFx Performance may vary depending on dataset audio duration and batch size.

Key Features

Accurate word-level timestamp predictions
Automatic punctuation and capitalization
Robust performance on spoken numbers, and song lyrics transcription

For more information, refer to the Model Architecture section and the NeMo documentation.

This model is ready for commercial/non-commercial use.

Benchmark Results

Huggingface Open-ASR-Leaderboard Performance

The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.

Base Performance

The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model):

Model	Avg WER	AMI	Earnings-22	GigaSpeech	LS test-clean	LS test-other	SPGI Speech	TEDLIUM-v3	VoxPopuli
parakeet-tdt-0.6b-v2	6.05	11.16	11.15	9.74	1.69	3.19	2.17	3.38	5.95

Noise Robustness

Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples:

SNR Level	Avg WER	AMI	Earnings	GigaSpeech	LS test-clean	LS test-other	SPGI	Tedlium	VoxPopuli	Relative Change
Clean	6.05	11.16	11.15	9.74	1.69	3.19	2.17	3.38	5.95	-
SNR 50	6.04	11.11	11.12	9.74	1.70	3.18	2.18	3.34	5.98	+0.25%
SNR 25	6.50	12.76	11.50	9.98	1.78	3.63	2.54	3.46	6.34	-7.04%
SNR 5	8.39	19.33	13.83	11.28	2.36	5.50	3.91	3.91	6.96	-38.11%

Telephony Audio Performance

Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion):

Audio Format	Avg WER	AMI	Earnings	GigaSpeech	LS test-clean	LS test-other	SPGI	Tedlium	VoxPopuli	Relative Change
Standard 16kHz	6.05	11.16	11.15	9.74	1.69	3.19	2.17	3.38	5.95	-
μ-law 8kHz	6.32	11.98	11.16	10.02	1.78	3.52	2.20	3.38	6.52	-4.10%

These WER scores were obtained using greedy decoding without an external language model. Additional evaluation details are available on the Hugging Face ASR Leaderboard.

NexaAI
/

parakeet-tdt-0.6b-v2-MLX