KirisameKyoka's picture
Update README.md
6917bb2 verified
metadata
library_name: mlx
tags:
  - mlx
  - automatic-speech-recognition
  - speech
  - audio
  - FastConformer
  - Conformer
  - Parakeet
license: cc-by-4.0
pipeline_tag: automatic-speech-recognition
base_model: nvidia/parakeet-tdt-0.6b-v2

NexaAI/parakeet-tdt-0.6b-v2-MLX

Quickstart

Run them directly with nexa-sdk installed In nexa-sdk CLI:

NexaAI/parakeet-tdt-0.6b-v2-MLX

Overview

parakeet-tdt-0.6b-v2 is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/nvidia/parakeet-tdt-0.6b-v2

This XL variant of the FastConformer architecture integrates the TDT decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass. The model achieves an RTFx of 3380 on the HF-Open-ASR leaderboard with a batch size of 128. Note: RTFx Performance may vary depending on dataset audio duration and batch size.

Key Features

  • Accurate word-level timestamp predictions
  • Automatic punctuation and capitalization
  • Robust performance on spoken numbers, and song lyrics transcription

For more information, refer to the Model Architecture section and the NeMo documentation.

This model is ready for commercial/non-commercial use.

Benchmark Results

Huggingface Open-ASR-Leaderboard Performance

The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.

Base Performance

The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model):

Model Avg WER AMI Earnings-22 GigaSpeech LS test-clean LS test-other SPGI Speech TEDLIUM-v3 VoxPopuli
parakeet-tdt-0.6b-v2 6.05 11.16 11.15 9.74 1.69 3.19 2.17 3.38 5.95

Noise Robustness

Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples:

SNR Level Avg WER AMI Earnings GigaSpeech LS test-clean LS test-other SPGI Tedlium VoxPopuli Relative Change
Clean 6.05 11.16 11.15 9.74 1.69 3.19 2.17 3.38 5.95 -
SNR 50 6.04 11.11 11.12 9.74 1.70 3.18 2.18 3.34 5.98 +0.25%
SNR 25 6.50 12.76 11.50 9.98 1.78 3.63 2.54 3.46 6.34 -7.04%
SNR 5 8.39 19.33 13.83 11.28 2.36 5.50 3.91 3.91 6.96 -38.11%

Telephony Audio Performance

Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion):

Audio Format Avg WER AMI Earnings GigaSpeech LS test-clean LS test-other SPGI Tedlium VoxPopuli Relative Change
Standard 16kHz 6.05 11.16 11.15 9.74 1.69 3.19 2.17 3.38 5.95 -
μ-law 8kHz 6.32 11.98 11.16 10.02 1.78 3.52 2.20 3.38 6.52 -4.10%

These WER scores were obtained using greedy decoding without an external language model. Additional evaluation details are available on the Hugging Face ASR Leaderboard.

Reference