Long deletions when reduce the right context using streaming_infer
Hi,
Thanks for making this great model available.
I've been doing some experiments reducing the right context length using speech_to_text_streaming_infer_rnnt.py whilst exploring the WER/delay trade-off for real-time streaming. I've noticed that when I reduce the right context the number of deletions goes up but the number of substitutions and insertions remain very similar. The length of the deletions also keeps increasing as the right context gets smaller. It means whole chunks of several words go missing rather than 1 or 2 words which is quite unfortunate. The deletions extended up to 21 words in a file I was trying with no obvious acoustic reason for those words being more challenging. Is this to be expected?
Thanks
What is the right context you are trying?
Could you raise this as an issue on https://github.com/NVIDIA/NeMo
I was using 10 secs left, 2 secs chunk and right 2.5 secs down to 0.25 secs just tease out the behaviour. I can raise an issue.