'thinking' in Qwen3-Blitzar-Coder-F1-6B-Brainstorm20x-i1-GGUF [modification when using ollama]

#1306
by dakerholdings - opened

This does not appear to have discrete implementation in the template as 'ollama run' would prefer (e.g. --think=false, and /set nothink do not turn off the section as apparently supported in other Qwen3 models). Even its --hidethinking flag doesn't seem to work.

You obviously can't turn off thinking in a brainstorming model and the reason that hiding the thinking doesn't work has probably has to do with how the model was finetuned to have its unique brainstorming feature. I recommend you give this feedback to the original author under: https://huggingface.co/DavidAU/Qwen3-Blitzar-Coder-F1-6B-Brainstorm20x - maybe he can adjust the chat template to thinking can be hidden. I don't think this is an issue of our quants.

i don't intend to download his version in order to see if it still happens there; this one came with its own chat template, which differs in obvious ways from thinking versions downloadable on ollama.com

[FWIW, yours went into an endless multi-line repeat, when answering a complex coding question, so I think you might want to adjust that too].

[I was also attempting to use the Q4_K_M version of this with qwen-coder, and it wasn't responding at all]

We always use the chat template provided by the original model author in the original model we quantize. It's the original model authors responsibility to provide the best possible chat template for his model. If @DavidAU in the meantime decided to change it he can let us know and we will requant the model with his latest chat template but based on the git history I don't think he changed it.

I would also hope that ollama would use the chat template from the model, and not override it with some random other template, by default. I would consider it a bug if ollama ignored it, so I am doubtful that this happens.

@dakerholdings

Please note the template used is the original from the fine tune itself.
You can see this in the "model tree" here:
https://huggingface.co/DavidAU/Qwen3-Blitzar-Coder-F1-6B-Brainstorm20x

I have not modified the model with the exception of the Brainstorm adapter which does not adjust the core model "code" so to speak.
I have verified the original "chat template" is embedded in the model.

Issue with Ollama must be addressed by Ollama ; or use a different AI app such as Lmstudio, Jan, Koboldcpp, Text Gen Web UI etc etc.

ollama surely seems to cause a whole lot of problems like this. what are they cooking...

@mradermacher

They are making a walled garden in the land of open source.
Their users are suffering ; something(s) are amiss in their framework and/or approach ...

They are making a walled garden in the land of open source.
Their users are suffering ; something(s) are amiss in their framework and/or approach ...

Which is exactly why I never used and will never use ollama. In the end it's just a wrapper around llama.cpp with a ton of bullshit added like converting models in that own stupid format, ignoring the original models chat template for no reason while being slower to add support for newly released models and not exposing any advanced llama.cpp features like RPC required to run huge models. It is worse than directly using llama-server llama.cpp in every way and only causes issues for their users.

@DavidAU

FWIW, I notice that other of your Qwen3 derivatives do appear to support similar optional 'thinking' (& tools) in-template, so maybe someone stripped such out?;
e.g.: https://huggingface.co/DavidAU/Qwen3-53B-A3B-2507-THINKING-TOTAL-RECALL-v2-MASTER-CODER?chat_template=default

I've been getting <= 15tps with this model, and am considering trying mlx instead of ollama, however, the closest I've seen so far appears to be something ~vanilla like:

https://huggingface.co/mlx-community/Qwen2.5-Coder-7B-Instruct-4bit

[Though I haven't tried such yet, from what I've read of ANE & apple optimization, my intuition tells me that something made via mlx-lm from higher precision source with --quant-predicate mixed_3_6 and maybe --dtype float16 might perform better? : if someone were to upload such, I think I'd be willing to try it out!]

{FYI: WRT 'getting <= 15tps' : this is if setting context length to a value low enough to all fit on GPU; if I set it to 128k, it drops to ~5tps, not much better than qwen3-coder:30b}

If all you care about inference speed and you have the required GPU memory to fit the entire model in 4 bit than vLLM is the way to go. vLLM is amazing especialy if you have multible GPUs so it can use tensor paralellism and many concurrent requests. vLLM is absolutely destroying every inference engine when it comes to inference speed at scale by supporting up to 256 concurrent requests at thousends of token per second generation speed. vLLM even supports GGUFs for many models.

Now tried vLLM; however, errors... even after getting around their apparent requirement for a config.json or tokenizer in association with gguf by pointing at previous repo : still getting error "RuntimeError: Cannot find any model weights", also apparently someone's recently ~broken Qwen3 support for now: https://github.com/vllm-project/vllm/issues/21511
Looks like they don't support M1 processor very well... even if above is overcome, seems unlikely I'll have enough ~vram to load entire model & kv cache, etc... (had to turn on flash & quantized kv on ollama for speed).

Instead of loading the GGUF in vLLM I would just load the original model but in 4 bits using bitsandbytes. Here the comman I used to run GLM-4.5V in 4 bits on 2x A100 40 GB:

CUDA_VISIBLE_DEVICES=0,1 venv/bin/vllm serve /pool16_2/GLM-4.5V --quantization bitsandbytes --max-model-len 20000 --served-model-name gpt-3.5-turbo --gpu-memory-utilization 0.94 --port 8000 --trust-remote-code --tensor-parallel-size 2 --distributed-executor-backend mp --enforce-eager

But this obviously assumes that you have enough GPU memory to load the model in 4 bit + context lenth. I never heared of vLLM beeing broken on M1 but I likely also wouldn't know as I don't have an M1. Regarding Qwen3 support I never had issues with it using vLLM as far I can remember and I for sure must have tried many Qwen3 based models.

@dakerholdings

The "thinking process" is not the straight forward for Qwens. You can of course use a system prompt with "" "" tags in it .

However the default is this process is activated directly in the model (including think tag generation) ; fine tune(s) / Merge(s) and other adjustments can prevent/block this
from working correctly.

The 53B has the org source + an adapter ; although technically a "fine tune" ; it does not affect the core model directly in the way fine tuning can / corrupt the model.

vLLM Support for Mac Metal : currently "Closed as not planned" https://github.com/vllm-project/vllm/issues/2081

vLLM Support for Mac Metal : currently "Closed as not planned" https://github.com/vllm-project/vllm/issues/2081

Nothing stops you from booting Linux on your Mac and running vLLM on Linux unless GPU you have only works on macOS or only using Metal. But as someone not owning any Apple hardware I simply don't know. But I now understand why you don't want to use vLLM as it probably isn't worth the hassle to setup in your case.

FWIW, I asked 'Qwen' what its name meant, and it indicated it was from the Chinese symbol(s) for '1000 questions', which I guess may explain their arcane ~proprietary XML tool usage and thinking issues ;-/ I think I may switch to a DeepSeek 3.1 (or GLM 4.5) variant if/when smaller models become available.

~Recent 'Apple Silicon' such as M1 is custom, inclusive of a set of CPU, TPU, and GPU cores (all sharing relatively high bandwidth memory) available depending on ~SKU. Unfortunately, it seems that all of the current llm ~servers & formats effectively require one to choose one processing variant, instead of maximizing simultaneous use of all of the available resources. From what I've read, outside of internally developed projects Apple hasn't made it very easy for external developers to use the (reputedly 16 bit matrix ops?) TPU/Neural(ANE) yet. I guess NVidia historically hasn't yet make it easy for direct access to their silicon for third party developers either. I'm not aware of any drivers having yet been reverse-engineered to the extent necessary to support llm's via other OS's on Apple's hardware (e.g. pytorch mps/metal).

FYI: I just used mlx_lm to serve up --model cs2764/Qwen3-4B-mlx-mixed_4_6, which appears to support ~'thinking suppression' without problems with--chat-template-config '{"enable_thinking":false}', but it apparently doesn't support the TPU/ANE yet, and has generally been running at 19-23tps (I guess maybe I should expect to have to upgrade my hardware to get good local performance), but the mlx_lm server's performance isn't as consistent as ollama's (seems to dynamically grow memory & hurl). In IDE's, 'RooCode' seems to run about the best with it (and your model also) so far, but appears to encounter some difficulties with tool usage.

Thanks for your assistance...

In most cases, llms are limited by memory throughput, at least for inferencing, so not supporting all three types is likely not as much as a limitation as one might superficially think.

Sign up or log in to comment