Qwopus3.6-27B-Coder-FP8 INT4 AutoRound

W4A16 INT4 AutoRound quantization of Jackrong/Qwopus3.6-27B-Coder-FP8.

  • Quantization: AutoRound INT4, group size 128, symmetric, auto_round:auto_gptq.
  • Source checkpoint: Jackrong/Qwopus3.6-27B-Coder-FP8 at the time of quantization.
  • Non-text multimodal modules are kept in their original precision.
  • Native Qwen3.5/Qwen3.6 MTP is preserved. mtp.fc is stored as BF16 mtp.fc.weight, not packed mtp.fc.qweight, so vLLM can load the MTP drafter.
  • Produced on one RunPod H200 SXM with AutoRound nightly.

vLLM

vllm serve WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

For long-context serving, raise --max-model-len according to your KV-cache budget.

vLLM CUDA 13 Smoke and Benchmarks

Smoke and throughput checks were run on 2026-06-14 with vllm 0.23.0, torch 2.11.0+cu130, Python 3.12.3, one NVIDIA B200, and NVIDIA driver 580.105.08. CUDA Toolkit release notes document per-release minimum driver requirements; in this run, a B200 host with driver 570.* failed CUDA 13 initialization, while driver 580.105.08 worked.

The working RunPod image was runpod/pytorch:1.0.3-cu1300-torch291-ubuntu2404 (cu13-pytorch2.9, template 0uy1f6v18r). After vLLM install, nvidia-cutlass-dsl-libs-cu13 was force-reinstalled once to fix a CUTLASS RECORD mismatch; after that vLLM used the FlashInfer GDN prefill kernel.

vLLM resolved this model as Qwen3_5ForConditionalGeneration, loaded the AutoRound/AutoGPTQ path with MarlinLinearKernel for AutoGPTQLinearMethod, and completed generation. MTP speculative decoding resolved Qwen3_5MTP, loaded without missing-parameter warnings, shared embedding/lm_head with the draft model, and completed generation.

Benchmarks used vllm bench throughput, fixed random prompts, max_model_len=8192, tensor parallel size 1, and local model files on overlay disk. TPS values are vLLM timed-section values; wall time includes model load, compile, CUDA graph capture, and warmup.

case input -> output prompts gpu util mode total tok/s prompt tok/s est output tok/s est peak VRAM GiB max W
balanced_graph_u65 1024 -> 128 64 0.65 graph 6369.6 5661.9 707.7 117.6 850.4
prefill_graph_u65 4096 -> 16 32 0.65 graph 7416.7 7387.8 28.9 117.6 857.4
decode_graph_u65 128 -> 256 64 0.65 graph 4221.6 1407.2 2814.4 116.6 819.7
balanced_eager_u65 1024 -> 128 32 0.65 eager 2453.9 2181.3 272.7 118.6 823.9
balanced_graph_u85 1024 -> 128 64 0.85 graph 6614.3 5879.4 734.9 153.9 851.3
balanced_mtp_u65 1024 -> 128 32 0.65 graph + MTP 4796.2 4263.3 532.9 118.1 846.5

First graph runs had cold costs around 77-80 seconds for torch.compile plus CUDA graph capture/profile. Repeated same-layout graph runs loaded the compile cache much faster. Eager mode was substantially slower than graph mode on this workload.

24GB RTX 3090 vLLM Smoke

A small fit smoke was run on 2026-06-14 on one RTX 3090 24GB RunPod host with NVIDIA driver 580.159.03 (nvidia-smi CUDA 13.0), vllm 0.23.0, torch 2.11.0+cu128, and runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404.

The smoke used max_model_len=32768, kv_cache_dtype=fp8, dtype=bfloat16, max_num_seqs=1, max_num_batched_tokens=2048, chunked prefill enabled, prefix caching disabled, and one 128 -> 16 random request. The vLLM Qwen3.5/Qwen3.6 recipe recommends MTP-1 speculative decoding with prefix caching disabled for latency-sensitive low-concurrency serving.

mode load format result peak VRAM KV cache 32k concurrency smoke throughput
no MTP fastsafetensors pass 22174 MiB 64170 tokens 1.96x 50.33 total tok/s, 5.59 output tok/s
MTP-1 safetensors pass 24110 MiB 60681 tokens 1.85x 28.94 total tok/s, 3.22 output tok/s
MTP-1 fastsafetensors fail 23778 MiB n/a n/a CUDA OOM while allocating a 3.00 GiB staging buffer

Recommended 24GB command shape:

vllm serve WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 2048 \
  --enable-chunked-prefill \
  --no-enable-prefix-caching \
  --load-format safetensors

For MTP-1 on 24GB, keep --load-format safetensors and add:

--speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Provenance

This repo was generated from the public Apache-2.0 source checkpoint. It keeps the upstream tokenizer, processor, chat template, vision config, and Qwen3.5 MTP config intact.

Downloads last month
184
Safetensors
Model size
6B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound

Quantized
(2)
this model
Free AI Image Generator No sign-up. Instant results. Open Now