| | --- |
| | license: cc-by-4.0 |
| | language: |
| | - en |
| | pipeline_tag: automatic-speech-recognition |
| | library_name: nemo |
| | thumbnail: null |
| | tags: |
| | - automatic-speech-recognition |
| | - speech |
| | - audio |
| | - Transducer |
| | - TDT |
| | - FastConformer |
| | - Conformer |
| | - pytorch |
| | - NeMo |
| | - hf-asr-leaderboard |
| | widget: |
| | - example_title: Librispeech sample 1 |
| | src: https://cdn-media.huggingface.co/speech_samples/sample1.flac |
| | - example_title: Librispeech sample 2 |
| | src: https://cdn-media.huggingface.co/speech_samples/sample2.flac |
| | model-index: |
| | - name: Quantum_STT_V2.0 |
| | results: |
| | - task: |
| | name: Automatic Speech Recognition |
| | type: automatic-speech-recognition |
| | dataset: |
| | name: AMI (Meetings test) |
| | type: edinburghcstr/ami |
| | config: ihm |
| | split: test |
| | args: |
| | language: en |
| | metrics: |
| | - name: Test WER |
| | type: wer |
| | value: 11.16 |
| | - task: |
| | name: Automatic Speech Recognition |
| | type: automatic-speech-recognition |
| | dataset: |
| | name: Earnings-22 |
| | type: revdotcom/earnings22 |
| | split: test |
| | args: |
| | language: en |
| | metrics: |
| | - name: Test WER |
| | type: wer |
| | value: 11.15 |
| | - task: |
| | name: Automatic Speech Recognition |
| | type: automatic-speech-recognition |
| | dataset: |
| | name: GigaSpeech |
| | type: speechcolab/gigaspeech |
| | split: test |
| | args: |
| | language: en |
| | metrics: |
| | - name: Test WER |
| | type: wer |
| | value: 9.74 |
| | - task: |
| | name: Automatic Speech Recognition |
| | type: automatic-speech-recognition |
| | dataset: |
| | name: LibriSpeech (clean) |
| | type: librispeech_asr |
| | config: other |
| | split: test |
| | args: |
| | language: en |
| | metrics: |
| | - name: Test WER |
| | type: wer |
| | value: 1.69 |
| | - task: |
| | name: Automatic Speech Recognition |
| | type: automatic-speech-recognition |
| | dataset: |
| | name: LibriSpeech (other) |
| | type: librispeech_asr |
| | config: other |
| | split: test |
| | args: |
| | language: en |
| | metrics: |
| | - name: Test WER |
| | type: wer |
| | value: 3.19 |
| | - task: |
| | type: Automatic Speech Recognition |
| | name: automatic-speech-recognition |
| | dataset: |
| | name: SPGI Speech |
| | type: kensho/spgispeech |
| | config: test |
| | split: test |
| | args: |
| | language: en |
| | metrics: |
| | - name: Test WER |
| | type: wer |
| | value: 2.17 |
| | - task: |
| | type: Automatic Speech Recognition |
| | name: automatic-speech-recognition |
| | dataset: |
| | name: tedlium-v3 |
| | type: LIUM/tedlium |
| | config: release1 |
| | split: test |
| | args: |
| | language: en |
| | metrics: |
| | - name: Test WER |
| | type: wer |
| | value: 3.38 |
| | - task: |
| | name: Automatic Speech Recognition |
| | type: automatic-speech-recognition |
| | dataset: |
| | name: Vox Populi |
| | type: facebook/voxpopuli |
| | config: en |
| | split: test |
| | args: |
| | language: en |
| | metrics: |
| | - name: Test WER |
| | type: wer |
| | value: 5.95 |
| | metrics: |
| | - wer |
| | base_model: |
| | - Quantamhash/Quantum_STT |
| | --- |
| | <div align="center"> |
| | <img src="https://huggingface.co/datasets/Quantamhash/Assets/resolve/main/images/dark_logo.png" |
| | alt="Title card" |
| | style="width: 500px; |
| | height: auto; |
| | object-position: center top;"> |
| | </div> |
| |
|
| | # **Quantum_STT_V2.0** |
| |
|
| | <style> |
| | img { |
| | display: inline; |
| | } |
| | </style> |
| |
|
| | [](#model-architecture) |
| | | [](#model-architecture) |
| | | [](#datasets) |
| |
|
| |
|
| | ## <span style="color:#466f00;">Description:</span> |
| |
|
| | `Quantum_STT_V2.0` is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/Quantamhash/Quantum_STT_V2.0 |
| |
|
| | This XL variant of the FastConformer [1] architecture integrates the TDT [2] decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass. |
| |
|
| | **Key Features** |
| | - Accurate word-level timestamp predictions |
| | - Automatic punctuation and capitalization |
| | - Robust performance on spoken numbers, and song lyrics transcription |
| |
|
| |
|
| | This model is ready for commercial/non-commercial use. |
| |
|
| |
|
| | ## <span style="color:#466f00;">License/Terms of Use:</span> |
| |
|
| | GOVERNING TERMS: Use of this model is governed by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) license. |
| |
|
| |
|
| | ### <span style="color:#466f00;">Deployment Geography:</span> |
| | Global |
| |
|
| |
|
| | ### <span style="color:#466f00;">Use Case:</span> |
| |
|
| | This model serves developers, researchers, academics, and industries building applications that require speech-to-text capabilities, including but not limited to: conversational AI, voice assistants, transcription services, subtitle generation, and voice analytics platforms. |
| |
|
| |
|
| | ### <span style="color:#466f00;">Release Date:</span> |
| |
|
| | 14/05/2025 |
| |
|
| | ### <span style="color:#466f00;">Model Architecture:</span> |
| |
|
| | **Architecture Type**: |
| |
|
| | FastConformer-TDT |
| |
|
| | **Network Architecture**: |
| |
|
| | * This model was developed based on [FastConformer encoder](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) architecture[1] and TDT decoder[2] |
| | * This model has 600 million model parameters. |
| |
|
| | ### <span style="color:#466f00;">Input:</span> |
| | - **Input Type(s):** 16kHz Audio |
| | - **Input Format(s):** `.wav` and `.flac` audio formats |
| | - **Input Parameters:** 1D (audio signal) |
| | - **Other Properties Related to Input:** Monochannel audio |
| |
|
| | ### <span style="color:#466f00;">Output:</span> |
| | - **Output Type(s):** Text |
| | - **Output Format:** String |
| | - **Output Parameters:** 1D (text) |
| | - **Other Properties Related to Output:** Punctuations and Capitalizations included. |
| |
|
| | Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. |
| |
|
| | ## <span style="color:#466f00;">How to Use this Model:</span> |
| |
|
| | To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version. |
| | ```bash |
| | pip install -U nemo_toolkit["asr"] |
| | ``` |
| | The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. |
| |
|
| | #### Automatically instantiate the model |
| |
|
| | ```python |
| | import nemo.collections.asr as nemo_asr |
| | asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="Quantamhash/Quantum_STT_V2.0") |
| | ``` |
| |
|
| | #### Transcribing using Python |
| | First, let's get a sample |
| | ```bash |
| | wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav |
| | ``` |
| | Then simply do: |
| | ```python |
| | output = asr_model.transcribe(['2086-149220-0033.wav']) |
| | print(output[0].text) |
| | ``` |
| |
|
| | #### Transcribing with timestamps |
| |
|
| | To transcribe with timestamps: |
| | ```python |
| | output = asr_model.transcribe(['2086-149220-0033.wav'], timestamps=True) |
| | # by default, timestamps are enabled for char, word and segment level |
| | word_timestamps = output[0].timestamp['word'] # word level timestamps for first sample |
| | segment_timestamps = output[0].timestamp['segment'] # segment level timestamps |
| | char_timestamps = output[0].timestamp['char'] # char level timestamps |
| | |
| | for stamp in segment_timestamps: |
| | print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}") |
| | ``` |
| |
|
| |
|
| | ## <span style="color:#466f00;">Software Integration:</span> |
| |
|
| | **Runtime Engine(s):** |
| | * NeMo 2.2 |
| |
|
| |
|
| | **[Preferred/Supported] Operating System(s):** |
| |
|
| | - Linux |
| |
|
| | **Hardware Specific Requirements:** |
| |
|
| | Atleast 2GB RAM for model to load. The bigger the RAM, the larger audio input it supports. |
| |
|
| | #### Model Version |
| |
|
| | Current version: Quantum_STT_V2.0. Previous versions can be [accessed](https://huggingface.co/Quantamhash/Quantum_STT) here. |
| |
|
| | ## <span style="color:#466f00;">Performance</span> |
| |
|
| | #### Huggingface Open-ASR-Leaderboard Performance |
| | The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio. |
| |
|
| | ### Base Performance |
| | The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model): |
| |
|
| | | **Model** | **Avg WER** | **AMI** | **Earnings-22** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI Speech** | **TEDLIUM-v3** | **VoxPopuli** | |
| | |:-------------|:-------------:|:---------:|:------------------:|:----------------:|:-----------------:|:-----------------:|:------------------:|:----------------:|:---------------:| |
| | | Quantum_STT_V2.0 | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - | |
| |
|
| | ### Noise Robustness |
| | Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples: |
| |
|
| | | **SNR Level** | **Avg WER** | **AMI** | **Earnings** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI** | **Tedlium** | **VoxPopuli** | **Relative Change** | |
| | |:---------------|:-------------:|:----------:|:------------:|:----------------:|:-----------------:|:-----------------:|:-----------:|:-------------:|:---------------:|:-----------------:| |
| | | Clean | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - | |
| | | SNR 50 | 6.04 | 11.11 | 11.12 | 9.74 | 1.70 | 3.18 | 2.18 | 3.34 | 5.98 | +0.25% | |
| | | SNR 25 | 6.50 | 12.76 | 11.50 | 9.98 | 1.78 | 3.63 | 2.54 | 3.46 | 6.34 | -7.04% | |
| | | SNR 5 | 8.39 | 19.33 | 13.83 | 11.28 | 2.36 | 5.50 | 3.91 | 3.91 | 6.96 | -38.11% | |
| |
|
| | ### Telephony Audio Performance |
| | Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion): |
| |
|
| | | **Audio Format** | **Avg WER** | **AMI** | **Earnings** | **GigaSpeech** | **LS test-clean** | **LS test-other** | **SPGI** | **Tedlium** | **VoxPopuli** | **Relative Change** | |
| | |:-----------------|:-------------:|:----------:|:------------:|:----------------:|:-----------------:|:-----------------:|:-----------:|:-------------:|:---------------:|:-----------------:| |
| | | Standard 16kHz | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - | |
| | | μ-law 8kHz | 6.32 | 11.98 | 11.16 | 10.02 | 1.78 | 3.52 | 2.20 | 3.38 | 6.52 | -4.10% | |
| |
|
| | These WER scores were obtained using greedy decoding without an external language model. |