THIS WORKS SO WELL on a 10 year old PC
Was not able to use both GPUs but just using the GTX 1660 with CPU expert offloading is GETTING a usable 14tps
CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-server --jinja -m ../Qwen3-Coder-30B-A3B-Instruct-IQ4_KSS.gguf \
-ngl 99 --port 6900 --host 100.126.169.3 -fa -amb 512 -mla 3 -fmoe -ot "blk\.(0|1|2|3|4|5|6|7|8)\.ffn_.*=CUDA0" \
-ot exps=CPU -c 13000
Wanted to offload some to the 2nd gpu, but for some reason the old 6700k i7 cpu had a lot more avx & fma instructions which intel later removed, and those are used great for the 4bit weights it seems.
Device 0: NVIDIA GeForce GTX 1660 Ti, compute capability 7.5, VMM: yes
INFO [ main] build info | tid="112116302307328" timestamp=1755975177 build=3857 commit="e008c0e1"
INFO [ main] system info | tid="112116302307328" timestamp=1755975177 n_threads=4 n_threads_batch=-1 total_threads=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
I'll try to get some offloading to the 2nd gpu, but it might be counterproductive.
Thank you, great job on the quant
Haha amazing test! Yes with only 3B active params quantized to ~4bpw its pretty usable speed even with older DDR memory!
One tip for your command, no need to specify -amb 512 -mla 3
as those are only relevent to MLA models like full sized DeepSeek and Kimi-K2 models. They have no effect on Qwen nor GLM models which use GQA or other attention mechanisms. So you can remove it to simplify.
What is the arch for the 1050TI? Is that newer than P40? I think it would have f16 support and might be barely new enough to work okay? You can try it easily enough offloading a layer by adding the next few layers like -ot "blk\.(9|10|11)\.ffn_.*=CUDA1"
until you fill up the VRAM.
Also you could try -ub 4096 -b 4096
for larger batch sizes for much faster PP at the cost of a little more VRAM.
Finally, you could try -rtr
with default batch sizes which may give a little more TG speed as it repacks the data into interleaved rows which can improve CPU/memory/cache performance.
Have fun!
ok so it turns out the 8+ year old intels do support F16 and have FMA instructions on chip. i.e. 10 year old i7 6700k runs it a lot better than 4 year old i7 on the laptop because newer-ish intels didnt have those extra instrucitons.
SO OFFLOADING THE KV CACHE --nkvo TO CPU WORKS WELL !!!
CUDA_VISIBLE_DEVICES=0,1 ./build/bin/llama-server --jinja -m ../Qwen3-Coder-30B-A3B-Instruct-IQ4_KSS.gguf \
-ngl 99 --port 6900 --host 100.126.169.3 -c 32000 -fa --n-cpu-moe 23 -rtr -ts 7,2 -nkvo
This ^ command worked well. Was able to paste in 12k long prompt. Speed was the same initially but after 20k tokens it dropped to 5tps. Still surprised this old system can actaully run a vibecoding tool now.
and yeah thw 1050 up to 1080 TI support float16 ( it seems :/) but 970 gtx980 dont.
by the way, i remember researching a while ago that older amd first gen EPYC and older intel xeons had UNDOCUMENTED fp4 support from the avx instructions but haven't been able to find much or anyone using this.
unsure if the 4bit datatypes already get accelerated by it anyway. But it'd probably work well on the shortlived generation of intels that had 128mb L3 cache
I think the issue ( or lack thereof :) of using multiple GPUs plus a cpu is that it saturates the PCIE bus too much. So unless theres an SLI 4x bridge between the cards i think it'd be better off to just use 1 and only use the other GPUs to offload memory to but not have them compute.
I have a hunch feeling it'd be better to only have 1 GPU pass around the hidden state, but am not able to test this with current setup unless I recompile llama.cpp