DeepSeek V4 Flash dsv4_int INT4/INT8

This checkpoint is for the AppMana Ampere vLLM fork:

https://github.com/AppMana/forks-vllm-ampere

The container image used for serving is published from that fork.

This checkpoint sets:

"__experimental_enable_imma_from_https://github.com/appMana/forks-vllm-ampere": true

That field is read by the AppMana fork to select the IMMA INT8 paths without requiring deployment-specific environment variables.

Quantization

Routed expert weights are converted ahead of time from the native DeepSeek-V4 MXFP4 expert tensors to INT4 W4A16 for Ampere.

For each expert weight tensor:

MXFP4 bytes are unpacked into e2m1 values.
The native e8m0 scale is applied to recover FP32 group values.
Values are grouped by 32 along the last dimension.
For scale_mode="mse", the converter tries per-group candidate scales abs_max / div, with div from 5.0 to 9.5.
Each candidate scale is rounded to BF16 before scoring.
For each candidate, INT4 codes are recomputed with round(group / scale).clamp(-8, 7).
The selected scale is the candidate that minimizes sum((q * scale - group) ** 2) for that group.

"MSE scale" means the scale is part of the round-trip quantize/dequantize error being minimized. It is not a fixed-scale integer-code MSE, and it is not a closed-form least-squares refit after choosing codes.

Stored expert format:

signed INT4 values encoded as uint4b8 nibbles (code + 8)
BF16 per-group scales
group size 32

Dense/shared expert/attention FP8 weights are converted to the AppMana INT8 W8A16 path. Where applicable, the stored tensor is channelwise biased UINT8 plus scale metadata.

Inference implementation

Stage	Technology	Source
Routed expert GEMM	Marlin MoE W4A16 INT4	vLLM upstream path with AppMana dsv4_int checkpoint loader in https://github.com/AppMana/forks-vllm-ampere
Dense and shared-expert GEMM	AllSpark W8A16 INT8	AppMana integration in https://github.com/AppMana/forks-vllm-ampere
Sparse MLA prefill/decode attention	AppMana CUDA Flash MLA / Sparse MLA kernel	https://github.com/AppMana/forks-flash-mla-ampere-dsv4
KV transfer	LMCache connector sidecar	https://github.com/AppMana/forks-lmcache

The quantization loader and conversion metadata are implemented in:

https://github.com/AppMana/forks-vllm-ampere/blob/appmana/vllm-ampere/vllm/model_executor/layers/quantization/dsv4_int.py

Random-prefix serving benchmarks

These benchmarks use random prompts with --random-prefix-len 0, so LMCache and vLLM prefix-cache hits are not part of the result. The serving topology is PP=10, TP=1 on RTX 3090 nodes, 512 generated tokens, --ignore-eos, --temperature 0, --max-num-batched-tokens 2048, FP8 KV cache, and max_num_seqs=6.

Command shape:

vllm bench serve \
  --backend openai \
  --base-url http://127.0.0.1:8080 \
  --endpoint /v1/completions \
  --model MODEL_ID \
  --dataset-name random \
  --random-input-len INPUT_TOKENS \
  --random-output-len 512 \
  --random-prefix-len 0 \
  --num-prompts C \
  --max-concurrency C \
  --request-rate inf \
  --ignore-eos \
  --temperature 0 \
  --percentile-metrics ttft,tpot,itl,e2el

Vanilla DeepSeek-V4-Flash FP4/FP8

Model: deepseek-ai/DeepSeek-V4-Flash.

Observed implementation path:

Stage	Observed path
Checkpoint quantization	`quant_method=fp8`, `fmt=e4m3`, `scale_fmt=ue8m0`, `weight_block_size=[128,128]`
Dense FP8 linear	`MarlinFP8ScaledMMLinearKernel` for `Fp8LinearMethod`; DeepGEMM UE8M0 config enabled
Routed experts	`expert_dtype='fp4'`, `Using 'MARLIN' Mxfp4 MoE backend`
Attention / MLA	DeepSeek V4 NVIDIA FlashMLA path; 200k OOM stack was in `vllm/models/deepseek_v4/nvidia/flashmla.py:127`
KV cache	`kv_cache_dtype=fp8`, `Using DeepSeek's fp8_ds_mla KV cache format`, FP8 indexer cache for Lightning Indexer
Compilation	vLLM compile mode and decode CUDA graphs were enabled, but workers logged that this DeepSeek-V4 model does not support `torch.compile`; monitored JIT mode was active
KV transfer	LMCache MP connector using `lmcache.c_ops`; benchmark prompts had `0.0%` prefix-cache and external-prefix-cache hit rates

Context	C	Input tokens	Result	TTFT mean	Prefill tok/s/stream	TPOT mean	Decode tok/s/stream
16k	1	15,872	pass	15.78 s	1005.7	61.61 ms	16.23
16k	2	15,872	pass	21.11 s	751.8	72.64 ms	13.77
16k	4	15,872	pass	31.97 s	496.5	93.50 ms	10.70
32k	1	32,256	pass	27.03 s	1193.1	62.18 ms	16.08
32k	2	32,256	pass	38.33 s	841.6	83.77 ms	11.94
32k	4	32,256	pass	60.99 s	528.9	128.00 ms	7.81
64k	1	65,024	pass	51.14 s	1271.4	62.93 ms	15.89
64k	2	65,024	pass	73.59 s	883.6	107.88 ms	9.27
64k	4	65,024	pass	121.81 s	533.8	198.18 ms	5.05
200k	1	199,488	OOM	n/a	n/a	n/a	n/a

The 200k C=1 request failed after startup with CUDA OOM on PP rank 9 while executing FlashMLA. The failed worker reported a 114 MiB allocation request with 95.75 MiB free; scheduler stats showed KV cache usage at 19.2%, so the failure was rank-local VRAM headroom rather than exhausting the configured KV cache.

AppMana DeepSeek-V4-Flash INT4/INT8

Model: appmana/deepseek-v4-int4-int8.

The INT4/INT8 runs below used the same PP=10, TP=1 serving topology and the same random-prefix benchmark shape as above. Decode-only throughput is computed as 1000 / TPOT_ms, so it is per active stream and separate from vLLM's aggregate output-token throughput.

Observed implementation path:

Stage	Observed path
Checkpoint quantization	Routed experts converted to INT4 W4A16; dense/shared expert/attention FP8 weights converted to AppMana INT8 W8A16 metadata
Dense and shared-expert GEMM	AllSpark W8A16 INT8 for most dense paths; `dequant_channel_bf16_wo_a` fallback appears for `.attn.wo_a`
Routed experts	Marlin MoE W4A16 INT4
Attention / MLA	AppMana DeepSeek V4 sparse MLA path in the vLLM fork
KV cache	FP8 KV cache with DeepSeek V4 indexer block size 256
KV transfer	LMCache MP connector using `lmcache.c_ops`; random-prefix benchmark rows showed no prefix-cache hits

Context	C	Input tokens	Result	TTFT mean	Prefill tok/s/stream	TPOT mean	Decode-only tok/s/stream	vLLM output tok/s
1k	1	1,000	pass	3.10 s	322.1	34.09 ms	29.33	24.94
16k	1	15,488	pass	15.55 s	995.7	34.48 ms	29.00	15.43
16k	2	15,488	stalled	n/a	n/a	n/a	n/a	n/a
32k	1	31,488	failed	n/a	n/a	n/a	n/a	n/a

The 16k C=2 row accepted both streams but did not make decode progress: metrics showed vllm:num_requests_running=2, vllm:generation_tokens_total=1, and no completed requests after several minutes. The client was still waiting for streamed chunks when it was stopped.

The 32k C=1 row failed after the request started. PP rank 7 reported CUDA error: an illegal memory access was encountered in vllm/v1/worker/gpu/buffer_utils.py:37 while copying idx_mapping_np to the GPU from prepare_inputs; PP rank 8 then failed receiving from the previous pipeline stage. vLLM recorded zero successful requests for that row.

FP4/FP8 vs INT4/INT8 Decode Summary

These rows separate decode-only throughput from TTFT and prefill. The INT4/INT8 16k input length was 15,488 tokens in this run; the FP4/FP8 16k input length was 15,872 tokens in the earlier run.

Context	C	FP4/FP8 TTFT	INT4/INT8 TTFT	FP4/FP8 prefill tok/s/stream	INT4/INT8 prefill tok/s/stream	FP4/FP8 TPOT	INT4/INT8 TPOT	FP4/FP8 decode-only tok/s/stream	INT4/INT8 decode-only tok/s/stream	INT4/INT8 result
16k	1	15.78 s	15.55 s	1005.7	995.7	61.61 ms	34.48 ms	16.23	29.00	pass
16k	2	21.11 s	n/a	751.8	n/a	72.64 ms	n/a	13.77	n/a	stalled
32k	1	27.03 s	n/a	1193.1	n/a	62.18 ms	n/a	16.08	n/a	failed

Downloads last month: 13

Safetensors

Model size

159B params

Tensor type

BF16

F32

I64

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for appmana/deepseek-v4-int4-int8

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(84)

this model