Update chat template to support multimodal function calling
Hi!
I noticed that the Qwen2.5-VL blog post mentions support for tool calling, but the current prompt template does not enable this functionality—it doesn’t handle tool messages. Since I needed tool calling for a research project, I rewrote the chat template accordingly.
My revision is based on the official Qwen documentation on function calling, as well as the chat template used for the base LLM on Hugging Face. I tried to adapt the vision-language chat template to match that structure more closely.
I've only tested the updated template with images so far, but in my case it successfully enabled tool calling with the AWQ version of the 32B model (even though it was unstable for my use case—agentic RAG—which I think is due to the model training data). I'm sharing it here in case it's helpful for others or can serve as a starting point for further improvement—hopefully this brings us closer to full tool support for the model.
For easier comparison, here is the current chat template:
{% set image_count = namespace(value=0) %}
{% set video_count = namespace(value=0) %}
{% for message in messages %}
{% if loop.first and message['role'] != 'system' %}
<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n
{% endif %}
<|im_start|>{{ message['role'] }}\n
{% if message['content'] is string %}
{{ message['content'] }}<|im_end|>\n
{% else %}
{% for content in message['content'] %}
{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}
{% set image_count.value = image_count.value + 1 %}
{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}
<|vision_start|><|image_pad|><|vision_end|>
{% elif content['type'] == 'video' or 'video' in content %}
{% set video_count.value = video_count.value + 1 %}
{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}
<|vision_start|><|video_pad|><|vision_end|>
{% elif 'text' in content %}
{{ content['text'] }}
{% endif %}
{% endfor %}
<|im_end|>\n
{% endif %}
{% endfor %}
{% if add_generation_prompt %}
<|im_start|>assistant\n
{% endif %}
And here is a proposed version that supports tool calling:
{%- set image_count = namespace(value=0) -%}
{%- set video_count = namespace(value=0) -%}
{%- macro render_content(message) -%}
{%- if message['content'] is string -%}
{{- message['content'] -}}
{%- else -%}
{%- for content in message['content'] -%}
{%- if content['type'] == 'image' or 'image' in content or 'image_url' in content -%}
{%- set image_count.value = image_count.value + 1 -%}
{%- if add_vision_id -%}
{{- 'Picture ' ~ image_count.value ~ ':' -}}
{%- endif -%}
{{- '<|vision_start|><|image_pad|><|vision_end|>' -}}
{%- elif content['type'] == 'video' or 'video' in content -%}
{%- set video_count.value = video_count.value + 1 -%}
{%- if add_vision_id -%}
{{- 'Video ' ~ video_count.value ~ ':' -}}
{%- endif -%}
{{- '<|vision_start|><|video_pad|><|vision_end|>' -}}
{%- elif 'text' in content -%}
{{- content['text'] -}}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{%- endmacro -%}
{%- if tools -%}
{{- '<|im_start|>system\n' -}}
{%- if messages[0]['role'] == 'system' -%}
{{- render_content(messages[0]) -}}
{%- else -%}
{{- 'You are a helpful assistant.' -}}
{%- endif -%}
{{- '\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>' -}}
{%- for tool in tools -%}
{{- '\n' -}}
{{- tool | tojson -}}
{%- endfor -%}
{{- '\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call><|im_end|>\n' -}}
{%- else -%}
{%- if messages[0]['role'] == 'system' -%}
{{- '<|im_start|>system\n' ~ render_content(messages[0]) ~ '<|im_end|>\n' -}}
{%- else -%}
{{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' -}}
{%- endif -%}
{%- endif -%}
{%- for message in messages -%}
{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) -%}
{{- '<|im_start|>' ~ message.role ~ '\n' ~ render_content(message) ~ '<|im_end|>\n' -}}
{%- elif message.role == "assistant" -%}
{{- '<|im_start|>' ~ message.role -}}
{%- if message.content -%}
{{- '\n' ~ render_content(message) -}}
{%- endif -%}
{%- for tool_call in message.tool_calls -%}
{%- if tool_call.function is defined -%}
{%- set tool_call = tool_call.function -%}
{%- endif -%}
{{- '\n<tool_call>\n{"name": "' ~ tool_call.name ~ '", "arguments": ' ~ (tool_call.arguments | tojson) ~ '}\n</tool_call>' -}}
{%- endfor -%}
{{- '<|im_end|>\n' -}}
{%- elif message.role == "tool" -%}
{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") -%}
{{- '<|im_start|>user' -}}
{%- endif -%}
{{- '\n<tool_response>\n' ~ render_content(message) ~ '\n</tool_response>' -}}
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") -%}
{{- '<|im_end|>\n' -}}
{%- endif -%}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- '<|im_start|>assistant\n' -}}
{%- endif -%}
(PS. In the one-liner I made a typo by putting an additional substring as the starting point for generation that should be removed)
Thanks for your work on this project, and all the best!
Nicola