Having issues with repetition.

by rdsm - opened Feb 23

Feb 23

Any one else having issues with the model repeating it self?
After some time deployed the model started repeating itself "The !!!!!!!!!!!!!!!!!(... continues indefinitely...)"

$ curl -s -X POST "http://[my internal url]/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/Kimi-K2.5-NVFP4",
    "messages": [{"role": "user", "content": "Hello! What can you do?"}],
    "max_tokens": 100,
    "temperature": 0.7
  }' | python -m json.tool
{
    "id": "chatcmpl-960b0d2b9bd89f72",
    "object": "chat.completion",
    "created": 1771856932,
    "model": "nvidia/Kimi-K2.5-NVFP4",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": " !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!",
                "refusal": null,
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": [],
                "reasoning": " !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
            },
            "logprobs": null,
            "finish_reason": "length",
            "stop_reason": null,
            "token_ids": null
        }
    ],
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "prompt_tokens": 33,
        "total_tokens": 133,
        "completion_tokens": 100,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null,
    "prompt_token_ids": null,
    "kv_transfer_params": null
}

Hardware: B200s

Aryoung

Feb 27

The same issue~~~
Any update？

rdsm

Feb 28

No luck, reverted the deploy back to the moonshot version.

Xinxinli

29 days ago

Hi @rdsm , can you describe me your deployment set up?

rdsm

28 days ago

@Xinxinli I am using 8xB300s, I got the vllm-openai image from cu130-nightly-7b6e5289bce66d33e338cdba5ea3e0db174d1f53 and applied the fixes from https://github.com/vllm-project/vllm/pull/33764#issuecomment-3916675391 this repo. Apparently a fix was merged into mainline on vllm.

g-a-b-y

13 days ago

How are you running vLLM? Works fine for me using v0.17.1

rdsm

13 days ago

How are you running vLLM? Works fine for me using v0.17.1

@g-a-b-y , I am using 2 configurations 4x B300s and 8x B300s, last tested on v0.18.0-cu130 , the model is initially fine, but after some load it eventually starts the repetition pattern. I have run heavy benchmarks and noticed no issues, then after it released to the public it starts again. seems to me to be related to some specific type of request that triggers the issue...

Initially I noticed the issue only on the NVFP4 variant, but now I see it also on the regular INT4 Moonshotai one too when using most recent vllm versions.

https://github.com/vllm-project/vllm/issues/36763 here are a few theories.

rdsm

13 days ago

@g-a-b-y , would you mind sharing more information about your deployment? maybe the startup flags and parameters that you are passing and the kind of load that it is being exposed?

My last attempt was on the regular moonshotai model, I tried this (enable_flashinfer_autotune was a suggestion from Wei Zhao:

/usr/bin/python3 /usr/local/bin/vllm serve moonshotai/Kimi-K2.5 --tensor-parallel-size 4 --mm-encoder-tp-mode data --trust-remote-code --tool-call-parser kimi_k2 --reasoning-parser kimi_k2 --gpu-memory-utilization 0.95 --enable-auto-tool-choice --kernel-config {"enable_flashinfer_autotune": false}

g-a-b-y

12 days ago

@rdsm I'm using 0.17.1 with CUDA 12.8

Same command as you, but my gpu memory is 0.93 and I don't have the kernel config setting enable_flashinfer_autotune.

0xbe7a

12 days ago

@rdsm We experienced the same exact issue on 4xB300 and only found out after some extended load

rdsm

12 days ago

@g-a-b-y are you running on B300s?

0xbe7a

12 days ago

Yes

g-a-b-y

12 days ago

@rdsm I'm not. I'm using a custom tool parser though. Could it be that?

It is the one from here: https://github.com/vllm-project/vllm/issues/37184#issuecomment-4073230433

rdsm

7 days ago

•

edited 7 days ago

Issue was found and fixed by the vllm team, vllm on v0.18.1 has the fix.
for v0.18.0 --attention-config.use_trtllm_ragged_deepseek_prefill=True fix the problem.
more details at: https://github.com/vllm-project/vllm/pull/38562

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment