Streaming?

#3
by pscar - opened

Thank you NVIDIA team for releasing yet another excellent ASR model!

Is there a guide on how to achieve streaming transcription using the latest parakeet-tdt-0.6b-v2 model?

NVIDIA org

You could do chunked streaming by following this script: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py directions on how to use is inside the script.

We noticed a bug with tdt for chunked streaming inference, we will push it soon to main for everyone to try!

We do also have dedicated cache-aware architecture for streaming use cases: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_streaming_multi . We are also working on an upgraded performant model to this one.

Hi @nithinraok . Thanks for that link. Waiting eagerly for the new streaming models! About the bug - do you recommend waiting for the bugfix if it's major or can the version on main be used already?

I second the idea for the live transcription. I would love an alternative to Whisper that had a decent interface that was running on my laptop and could work offline. Press a key, record your voice, let go of the key, transcribes, and pastes into a field.

Has it been fixed yet?
Or is there any update on the progress?

BatchedFrameASRTDT, ImportErro
rError. Could not import.

NVIDIA org

Yes, the fix is now merged to main. Use this script for performing buffered streaming: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py

Hi @nithinraok , thank you so much for the update! One question out of curiosity: according to the relevant commit, TDT does not currently support greedy_batch decoding strategy, but the .nemo file in this repository defaults to greedy_batch decoding strategy. Is this expected?

NVIDIA org

Yes that's used by default for offline. For streaming its get changed to greedy for now.

Thanks for your update. I saw that your huggingface demo has an interactive interface made with gradio. Can I deploy the streaming model interface on the server myself and use your gradio for non-commercial display?

Hi,

I am working on Real-Time Mic version, I have working one ready to test:

https://huggingface.co/spaces/WJ88/NVIDIA-Parakeet-TDT-0.6B-v2-INT8-Real-Time-Mic-Transcription
*the whole point of this space is to fit the model into 2vCPUs :) and it works!

The UI may not be nice but in overall just click RECORD, speak and watch transcription. After you finish, refresh the Browser Tab. (to free resources (please)
NOTE: the app is currently public, I mean, each user transcriptions are accumulating and other users cans see them, I am working on isolation but it is what it is, it works :)

You can use NVIDIA-Parakeet-TDT-0.6B-v2 without NVIDIA card in REAL-TIME - I encourage you to check it, check the code (its interesting that the model fits to 2vCPUs)
and finally clone and base on that make your own version! I will stick to optimizations and not fancy features in my repo.

"I love Pain"

I am in the main branch (commit 259d684e73c45091f0b6144342133e6ceb7e824c)
@nithinraok you mentioned that tdt streaming is fixed. Just checking again.
The script speech_to_text_buffered_infer_rnnt calles BatchedFrameASRTDT for tdt from streaming_utils.py with argument stateful_decoding, which I pass true.
But the class BatchedFrameASRTDT in its turn calls BatchedFrameASRRNNT (parent) like this
super().init(asr_model, frame_len=frame_len, total_buffer=total_buffer, batch_size=batch_size)
without passing stateful_decoding, and thus, it remains false is defined in default.

Is that how you intended it to be? Stateful decoding always false?

I'm not sure if I'm just too inexperienced to get this to work. I used the file referenced above: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py

And I downloaded parakeet using the huggingface-cli
huggingface-cli download nvidia/parakeet-tdt-0.6b-v2 parakeet-tdt-0.6b-v2.nemo

Then I tried to run the file using
python speech_to_text_buffered_infer_rnnt.py model_path=~/.cache/huggingface/hub/models--nvidia--parakeet-tdt-0.6b-v2/snapshots/c4b828d094af2c7238dfe03b58e0c56bc69ea57a/parakeet-tdt-0.6b-v2.nemo +audio_dir=audio_dir chunk_len_in_secs=1.0 total_buffer_in_secs=1.5 cuda=0 batch_size=1

But I keep getting hydra.errors.MissingConfigException: In 'TranscriptionConfig': Could not find 'audio_dir/audio_dir'

I tried using a manifest (suggested by ChatGPT)
python speech_to_text_buffered_infer_rnnt.py model_path=~/.cache/huggingface/hub/models--nvidia--parakeet-tdt-0.6b-v2/snapshots/c4b828d094af2c7238dfe03b58e0c56bc69ea57a/parakeet-tdt-0.6b-v2.nemo dataset_manifest=my_manifest.jsonl chunk_len_in_secs=1.0 total_buffer_in_secs=1.5 cuda=0 batch_size=1

But then i get

[NeMo I 2025-06-10 03:17:06 nemo_logging:393] Inference will be done on device : [0]
Error executing job with overrides: ['model_path=~/.cache/huggingface/hub/models--nvidia--parakeet-tdt-0.6b-v2/snapshots/c4b828d094af2c7238dfe03b58e0c56bc69ea57a/parakeet-tdt-0.6b-v2.nemo', 'dataset_manifest=my_manifest.jsonl', 'chunk_len_in_secs=1.0', 'total_buffer_in_secs=1.5', 'cuda=0', 'batch_size=1']
Traceback (most recent call last):
  File "/home/foo/speech_to_text_buffered_infer_rnnt.py", line 191, in main
    asr_model, model_name = setup_model(cfg, map_location)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/foo/virtual_env/lib/python3.12/site-packages/nemo/collections/asr/parts/utils/transcribe_utils.py", line 268, in setup_model
    model_cfg = ASRModel.restore_from(restore_path=cfg.model_path, return_config=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/foo/virtual_env/lib/python3.12/site-packages/nemo/core/classes/modelPT.py", line 476, in restore_from
    if is_multistorageclient_url(restore_path):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/foo/virtual_env/lib/python3.12/site-packages/nemo/utils/msc_utils.py", line 43, in is_multistorageclient_url
    has_msc_prefix = path and str(path).startswith(msc.types.MSC_PROTOCOL)
                                                   ^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'types'

Using python 3.12.3 if that makes a difference. I'm also using the latest from @Main via
pip install git+https://github.com/NVIDIA/NeMo.git@main

Could someone help?

Sign up or log in to comment