[Bug] Severe Audio Degradation in Long-form Generation (>60s)

#1
by mrwd2005 - opened

Thanks for the amazing effort in releasing Qwen3-Omni and bringing the INT4 AutoRound quantized version to vllm-omni. I have been testing this model and thoroughly benchmarking its streaming E2E capabilities. While the performance for short requests is flawless, I’ve encountered a consistent audio quality degradation issue for longer audio generation.

🖥️ Hardware & Software Environment

  • Platform/OS: Linux (Ubuntu E2E Local Server)
  • GPU: 2 × NVIDIA RTX 4080 SUPER (16GB x 2, Total 32GB VRAM)
  • Model: Intel/Qwen3-Omni-30B-A3B-Instruct-int4-AutoRound
  • Framework: vllm-omni (Using PR #2670)
  • Strategy: qwen3_omni_moe_async_chunk config (optimized VRAM split: Thinker: 0.85, Talker: 0.35, Code2Wav: 0.05).
  • Mode: stream=True

🐛 Issue: Audio Collapses and Degrades After 60-70 Seconds

Description:
When generating short conversational texts < 150 words (which typically translates to < 40 seconds of audio), the audio is perfect. Voices like chelsie or ethan render flawlessly with rich emotional nuances and an impressive RTF around 0.19.

However, there is a hard degradation threshold when the text prompts force a longer response (> 200 words). Once the generated audio pushes past the ~60 to 70 seconds mark, the voice suffers from catastrophic decay.

  • Symptoms include: Slurred speech, heavy robotic artifacts, repetitive stuttering, and eventual complete gibberish, even though the concurrently generated text delta is still perfectly coherent and accurate.

Question for the Authors:
Could this early degradation be an inherent capability limit of the original model's architecture itself when generating long audio? Alternatively, is it a known limit with the context window/memory decay specific to the AutoRound INT4 quantized weights, or perhaps heavily related to the attention sink or context accumulation bottleneck inside the code2wav / talker streaming stages under vLLM?

Summary:
The latency and streaming experience up to 40 seconds is absolute magic over consumer GPUs. Any insights or planned updates addressing the audio collapse for the >60s length boundaries would be extremely appreciated! Thanks again for the incredible work.

Intel org

Thank you so much for your detailed report and for putting the model through such rigorous benchmarking! We really appreciate the thorough documentation of your environment and the clear description of the symptoms.

We will follow up on this issue. Our first step will be to determine whether the degradation originates from the base model itself or is introduced by the INT4 AutoRound quantization process. Once the root cause is identified, we will investigate whether a fundamental fix is feasible — whether that means adjustments on the quantization side, improvements to the streaming pipeline within vllm-omni, or coordination with the upstream model authors.

We'll keep this thread updated as we make progress. Thanks again for your patience and support!

Intel org

@mrwd2005
We've identified and resolved an issue with unexpected talker quantization. The model has been updated. Please try again and let us know if the problem persists. Thanks.

Sign up or log in to comment

Free AI Image Generator No sign-up. Instant results. Open Now