Inconsistent Output: First API call differs from subsequent identical calls with temperature=0 on Qwen models
We are observing inconsistent output behavior when using vLLM (v0.8.1 via vllm/vllm-openai:v0.8.1 Docker image) to serve Qwen-based models (specifically tested with DeepSeek-R1-Distill-Qwen-32B and Qwen2.5-72B-Instruct). When sending multiple identical API requests with temperature: 0, the response from the very first request differs from the responses of subsequent requests (2nd, 3rd, etc.). The responses from the 2nd request onwards are consistent with each other. This behavior is unexpected, especially with temperature: 0, which should ideally lead to deterministic outputs for identical inputs.
Environment:
vLLM Version: v0.8.1 (from Docker image vllm/vllm-openai:v0.8.1)
Model(s) Affected:
DeepSeek-R1-Distill-Qwen-32B
Qwen2.5-72B-Instruct (exhibits the same issue)
Deployment Method: Docker
GPU Configuration: 2 H100 GPUs (e.g., "device=4,5")
Tensor Parallel Size: 2
Operating System: Ubuntu 11.4.0-1ubuntu1~22.04
NVIDIA Driver Version: 570.124.06
CUDA Version (inside Docker): 12.8
Steps to Reproduce:
Deploy the model using vLLM.
For example, with DeepSeek-R1-Distill-Qwen-32B:
Bash
sudo docker run -d
--runtime nvidia
--gpus '"device=4,5"'
--ipc=host
--name deepseek-distill-32B-container
-p 8082:8082
-v /mnt/data1/your_path_to_model/model:/model
vllm/vllm-openai:v0.8.1
--model /model/DeepSeek-R1-Distill-Qwen-32B
--port 8082
--tensor-parallel-size 2
(Note: Adjust GPU devices and model path as per your setup.)
Send an initial API request using curl (or any HTTP client) to the /v1/chat/completions endpoint.
Example request for DeepSeek-R1-Distill-Qwen-32B:
Bash
curl http://YOUR_SERVER_IP:8082/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "/model/DeepSeek-R1-Distill-Qwen-32B",
"messages": [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "请帮我生成一份去南极洲旅行的计划,可以天马行空,不低于500字"}
],
"temperature": 0,
"top_p": 0.8,
"repetition_penalty": 1.05,
"max_tokens": 16384
}'
(Note: Replace YOUR_SERVER_IP with the actual IP address of your host and /model/DeepSeek-R1-Distill-Qwen-32B with the correct model path if different from the example deployment.)
Observe the output/response from this first request.
Send the exact same API request (identical payload, headers, and endpoint) multiple more times (e.g., 4 more times for a total of 5 requests).
Compare the outputs.
Expected Behavior:
Given that temperature is set to 0, all five (or more) API calls, being completely identical in terms of input parameters and messages, should yield the exact same output string.
Actual Behavior:
The output of the first API call is unique.
The outputs of the second, third, fourth, and fifth (and subsequent) API calls are identical to each other, but different from the output of the first call.
Example Scenario:
Request 1 -> Output A
Request 2 -> Output B
Request 3 -> Output B
Request 4 -> Output B
Request 5 -> Output B
Where Output A is different from Output B.
Additional Context:
The issue has been consistently reproduced with the specified models and vLLM version.
The input prompt (user content: "请帮我生成一份去南极洲旅行的计划,可以天马行空,不低于500字") is in Chinese.
The parameters top_p: 0.8 and repetition_penalty: 1.05 were used, but the key parameter for deterministic behavior is temperature: 0.
This could potentially be an initialization issue, a caching issue, or how the model state is handled for the very first inference request versus subsequent ones within a session or after the model is fully "warmed up."
We would appreciate any insights or fixes for this behavior. Thank you!