Using vllm 0.10.0, `reasoning content` appears in content field instead of reasoning_content with Qwen3-30B-A3B-Thinking model
English:
I'm encountering an issue with the Qwen3-30B-A3B-Thinking model when using vLLM v0.10.0. The model's reasoning process is being returned in the content
field instead of the expected reasoning_content
field.
Setup:
- vLLM version: v0.10.0
- Model: Qwen3-30B-A3B-Thinking-2507
- Startup command:
vllm serve <MODEL_PATH> --host 0.0.0.0 --port 8000 --dtype bfloat16 --max-model-len 32768 --gpu-memory-utilization 0.9 --served-model-name Qwen3-30B-A3B-Thinking-2507 --enable-lora --max-loras 8 --max-lora-rank 32 --trust-remote-code --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser hermes --enable-chunked-prefill
Issue:
When making a chat completion request, the model's reasoning content appears in the content
field with the following characteristics:
- Missing opening
<think>
tag - Contains closing
</think>
tag - All reasoning content is mixed with the final response in
content
- The
reasoning_content
field isnull
Expected behavior:
The reasoning process should be separated into the reasoning_content
field, with only the final answer in the content
field.
Questions:
- Is this a vLLM startup configuration issue?
- Does the model's chat template need special configuration?
- Are there additional parameters needed for proper reasoning content separation?
Any guidance from the community would be greatly appreciated!
中文:
我在使用 vLLM v0.10.0 运行 Qwen3-30B-A3B-Thinking 模型时遇到了一个问题。模型的推理过程被返回在 content
字段中,而不是预期的 reasoning_content
字段中。
环境配置:
- vLLM 版本: v0.10.0
- 模型: Qwen3-30B-A3B-Thinking-2507
- 启动命令包含
--reasoning-parser qwen3
参数
问题现象:
在进行聊天补全请求时,模型的推理内容出现在 content
字段中,具有以下特征:
- 缺少开始的
<think>
标签 - 包含结束的
</think>
标签 - 所有推理内容与最终回复混合在
content
中 reasoning_content
字段为null
预期行为:
推理过程应该被分离到 reasoning_content
字段中,content
字段只包含最终答案。
疑问:
- 这是 vLLM 启动配置的问题吗?
- 模型的聊天模板是否需要特殊配置?
- 是否需要额外的参数来正确分离推理内容?
非常感谢社区的任何指导!
Sample Response (Problematic)
{
"id": "chatcmpl-8f283e39ea9d450593e7b22a2618fbd9",
"object": "chat.completion",
"created": 1754014569,
"model": "Qwen3-30B-A3B-Thinking-2507",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "嗯,用户发来"你好",看起来是个很简单的问候。首先,我需要确认用户的需求。可能他们刚开始和我对话,想测试一下我的反应。也可能是想开始一个新的话题,但还没想好具体要问什么。\n\n接下来,我要考虑如何回应。作为AI助手,应该友好、热情,同时引导用户说出他们的需求。避免太长的回复,保持简洁。可能需要用中文,因为用户的问候是中文的。\n\n然后,检查有没有其他潜在需求。比如,用户可能之前有其他问题,但这里只是开始。或者他们可能想确认我是否在线。所以,回复要包含欢迎语,并主动询问需要什么帮助。\n\n还要注意语气,不要太正式,用"你好!"开头,加个表情符号可能更亲切,比如😊。然后列出几个常见帮助方向,比如解答问题、写故事、编程等,这样用户知道可以怎么用我。\n\n另外,避免使用复杂词汇,确保易懂。比如"有什么问题或需要帮忙的吗?"这样比较自然。可能还要检查有没有错别字,确保回复准确。\n\n最后,确保符合之前的指示,比如不涉及敏感内容,保持中立。现在组织语言:先问候,再表达乐意帮忙,然后举例说明能提供的帮助,最后邀请用户提问。这样结构清晰,用户容易回应。\n</think>\n\n你好!😊 很高兴见到你~有什么问题或需要帮忙的吗?比如解答疑问、写故事、编程,或者随便聊聊都可以哦!等你来告诉我~",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning_content": null
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 11,
"total_tokens": 326,
"completion_tokens": 315,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"kv_transfer_params": null
}
Expected Response Structure
{
"choices": [
{
"message": {
"role": "assistant",
"content": "你好!😊 很高兴见到你~有什么问题或需要帮忙的吗?比如解答疑问、写故事、编程,或者随便聊聊都可以哦!等你来告诉我~",
"reasoning_content": "嗯,用户发来"你好",看起来是个很简单的问候。首先,我需要确认用户的需求...[reasoning process]"
}
}
]
}
Read the description again from this model.
Additionally, to enforce model thinking, the default chat template automatically includes . Therefore, it is normal for the model's output to contain only without an explicit opening tag.
LMStudio updated their code to support this. No idea about vllm.
https://lmstudio.ai/beta-releases#release-notes---lm-studio-0321-build-2-beta
@aweffr the chat_template is wrong, you can use this template on vllm: https://huggingface.co/jart25/Qwen3-30B-A3B-Thinking-2507-Autoround-Int-8bit-gptq/blob/main/chat_template.jinja
You can use vllm serve <MODEL_PATH> <other_parameters> --chat-template ./chat_template.jinja
to prompt the model to generate <think>
tags on its own. This works for Open WebUI. I just tested it and found it also works for chat completion requests :)
Save the following code to ./chat_template.jinja
:
{%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0].role == 'system' %}
{{- messages[0].content + '\n\n' }}
{%- endif %}
{{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
{%- if messages[0].role == 'system' %}
{{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
{%- set index = (messages|length - 1) - loop.index0 %}
{%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
{%- set ns.multi_step_tool = false %}
{%- set ns.last_query_index = index %}
{%- endif %}
{%- endfor %}
{%- for message in messages %}
{%- if message.content is string %}
{%- set content = message.content %}
{%- else %}
{%- set content = '' %}
{%- endif %}
{%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
{{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
{%- elif message.role == "assistant" %}
{%- set reasoning_content = '' %}
{%- if message.reasoning_content is string %}
{%- set reasoning_content = message.reasoning_content %}
{%- else %}
{%- if '</think>' in content %}
{%- set reasoning_content = ((content.split('</think>')|first).rstrip('\n').split('<think>')|last).lstrip('\n') %}
{%- set content = (content.split('</think>')|last).lstrip('\n') %}
{%- endif %}
{%- endif %}
{%- if loop.index0 > ns.last_query_index %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
{%- if message.tool_calls %}
{%- for tool_call in message.tool_calls %}
{%- if (loop.first and content) or (not loop.first) %}
{{- '\n' }}
{%- endif %}
{%- if tool_call.function %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '<tool_call>\n{"name": "' }}
{{- tool_call.name }}
{{- '", "arguments": ' }}
{%- if tool_call.arguments is string %}
{{- tool_call.arguments }}
{%- else %}
{{- tool_call.arguments | tojson }}
{%- endif %}
{{- '}\n</tool_call>' }}
{%- endfor %}
{%- endif %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "tool" %}
{%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
{{- '<|im_start|>user' }}
{%- endif %}
{{- '\n<tool_response>\n' }}
{{- content }}
{{- '\n</tool_response>' }}
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- endif %}