DeepSeek V4 Flash dsv4_int INT4/INT8
This checkpoint is for the AppMana Ampere vLLM fork:
https://github.com/AppMana/forks-vllm-ampere
The container image used for serving is published from that fork.
This checkpoint sets:
"__experimental_enable_imma_from_https://github.com/appMana/forks-vllm-ampere": true
That field is read by the AppMana fork to select the IMMA INT8 paths without requiring deployment-specific environment variables.
Quantization
Routed expert weights are converted ahead of time from the native DeepSeek-V4 MXFP4 expert tensors to INT4 W4A16 for Ampere.
For each expert weight tensor:
- MXFP4 bytes are unpacked into e2m1 values.
- The native e8m0 scale is applied to recover FP32 group values.
- Values are grouped by 32 along the last dimension.
- For
scale_mode="mse", the converter tries per-group candidate scalesabs_max / div, withdivfrom5.0to9.5. - Each candidate scale is rounded to BF16 before scoring.
- For each candidate, INT4 codes are recomputed with
round(group / scale).clamp(-8, 7). - The selected scale is the candidate that minimizes
sum((q * scale - group) ** 2)for that group.
"MSE scale" means the scale is part of the round-trip quantize/dequantize error being minimized. It is not a fixed-scale integer-code MSE, and it is not a closed-form least-squares refit after choosing codes.
Stored expert format:
- signed INT4 values encoded as uint4b8 nibbles (
code + 8) - BF16 per-group scales
- group size 32
Dense/shared expert/attention FP8 weights are converted to the AppMana INT8 W8A16 path. Where applicable, the stored tensor is channelwise biased UINT8 plus scale metadata.
Inference implementation
| Stage | Technology | Source |
|---|---|---|
| Routed expert GEMM | Marlin MoE W4A16 INT4 | vLLM upstream path with AppMana dsv4_int checkpoint loader in https://github.com/AppMana/forks-vllm-ampere |
| Dense and shared-expert GEMM | AllSpark W8A16 INT8 | AppMana integration in https://github.com/AppMana/forks-vllm-ampere |
| Sparse MLA prefill/decode attention | AppMana CUDA Flash MLA / Sparse MLA kernel | https://github.com/AppMana/forks-flash-mla-ampere-dsv4 |
| KV transfer | LMCache connector sidecar | https://github.com/AppMana/forks-lmcache |
The quantization loader and conversion metadata are implemented in:
Random-prefix serving benchmarks
These benchmarks use random prompts with --random-prefix-len 0, so LMCache
and vLLM prefix-cache hits are not part of the result. The serving topology is
PP=10, TP=1 on RTX 3090 nodes, 512 generated tokens, --ignore-eos,
--temperature 0, --max-num-batched-tokens 2048, FP8 KV cache, and
max_num_seqs=6.
Command shape:
vllm bench serve \
--backend openai \
--base-url http://127.0.0.1:8080 \
--endpoint /v1/completions \
--model MODEL_ID \
--dataset-name random \
--random-input-len INPUT_TOKENS \
--random-output-len 512 \
--random-prefix-len 0 \
--num-prompts C \
--max-concurrency C \
--request-rate inf \
--ignore-eos \
--temperature 0 \
--percentile-metrics ttft,tpot,itl,e2el
Vanilla DeepSeek-V4-Flash FP4/FP8
Model: deepseek-ai/DeepSeek-V4-Flash.
Observed implementation path:
| Stage | Observed path |
|---|---|
| Checkpoint quantization | quant_method=fp8, fmt=e4m3, scale_fmt=ue8m0, weight_block_size=[128,128] |
| Dense FP8 linear | MarlinFP8ScaledMMLinearKernel for Fp8LinearMethod; DeepGEMM UE8M0 config enabled |
| Routed experts | expert_dtype='fp4', Using 'MARLIN' Mxfp4 MoE backend |
| Attention / MLA | DeepSeek V4 NVIDIA FlashMLA path; 200k OOM stack was in vllm/models/deepseek_v4/nvidia/flashmla.py:127 |
| KV cache | kv_cache_dtype=fp8, Using DeepSeek's fp8_ds_mla KV cache format, FP8 indexer cache for Lightning Indexer |
| Compilation | vLLM compile mode and decode CUDA graphs were enabled, but workers logged that this DeepSeek-V4 model does not support torch.compile; monitored JIT mode was active |
| KV transfer | LMCache MP connector using lmcache.c_ops; benchmark prompts had 0.0% prefix-cache and external-prefix-cache hit rates |
| Context | C | Input tokens | Result | TTFT mean | Prefill tok/s/stream | TPOT mean | Decode tok/s/stream |
|---|---|---|---|---|---|---|---|
| 16k | 1 | 15,872 | pass | 15.78 s | 1005.7 | 61.61 ms | 16.23 |
| 16k | 2 | 15,872 | pass | 21.11 s | 751.8 | 72.64 ms | 13.77 |
| 16k | 4 | 15,872 | pass | 31.97 s | 496.5 | 93.50 ms | 10.70 |
| 32k | 1 | 32,256 | pass | 27.03 s | 1193.1 | 62.18 ms | 16.08 |
| 32k | 2 | 32,256 | pass | 38.33 s | 841.6 | 83.77 ms | 11.94 |
| 32k | 4 | 32,256 | pass | 60.99 s | 528.9 | 128.00 ms | 7.81 |
| 64k | 1 | 65,024 | pass | 51.14 s | 1271.4 | 62.93 ms | 15.89 |
| 64k | 2 | 65,024 | pass | 73.59 s | 883.6 | 107.88 ms | 9.27 |
| 64k | 4 | 65,024 | pass | 121.81 s | 533.8 | 198.18 ms | 5.05 |
| 200k | 1 | 199,488 | OOM | n/a | n/a | n/a | n/a |
The 200k C=1 request failed after startup with CUDA OOM on PP rank 9 while executing FlashMLA. The failed worker reported a 114 MiB allocation request with 95.75 MiB free; scheduler stats showed KV cache usage at 19.2%, so the failure was rank-local VRAM headroom rather than exhausting the configured KV cache.
AppMana DeepSeek-V4-Flash INT4/INT8
Model: appmana/deepseek-v4-int4-int8.
The INT4/INT8 runs below used the same PP=10, TP=1 serving topology and the
same random-prefix benchmark shape as above. Decode-only throughput is computed
as 1000 / TPOT_ms, so it is per active stream and separate from vLLM's
aggregate output-token throughput.
Observed implementation path:
| Stage | Observed path |
|---|---|
| Checkpoint quantization | Routed experts converted to INT4 W4A16; dense/shared expert/attention FP8 weights converted to AppMana INT8 W8A16 metadata |
| Dense and shared-expert GEMM | AllSpark W8A16 INT8 for most dense paths; dequant_channel_bf16_wo_a fallback appears for .attn.wo_a |
| Routed experts | Marlin MoE W4A16 INT4 |
| Attention / MLA | AppMana DeepSeek V4 sparse MLA path in the vLLM fork |
| KV cache | FP8 KV cache with DeepSeek V4 indexer block size 256 |
| KV transfer | LMCache MP connector using lmcache.c_ops; random-prefix benchmark rows showed no prefix-cache hits |
| Context | C | Input tokens | Result | TTFT mean | Prefill tok/s/stream | TPOT mean | Decode-only tok/s/stream | vLLM output tok/s |
|---|---|---|---|---|---|---|---|---|
| 1k | 1 | 1,000 | pass | 3.10 s | 322.1 | 34.09 ms | 29.33 | 24.94 |
| 16k | 1 | 15,488 | pass | 15.55 s | 995.7 | 34.48 ms | 29.00 | 15.43 |
| 16k | 2 | 15,488 | stalled | n/a | n/a | n/a | n/a | n/a |
| 32k | 1 | 31,488 | failed | n/a | n/a | n/a | n/a | n/a |
The 16k C=2 row accepted both streams but did not make decode progress: metrics
showed vllm:num_requests_running=2, vllm:generation_tokens_total=1, and no
completed requests after several minutes. The client was still waiting for
streamed chunks when it was stopped.
The 32k C=1 row failed after the request started. PP rank 7 reported
CUDA error: an illegal memory access was encountered in
vllm/v1/worker/gpu/buffer_utils.py:37 while copying idx_mapping_np to the
GPU from prepare_inputs; PP rank 8 then failed receiving from the previous
pipeline stage. vLLM recorded zero successful requests for that row.
FP4/FP8 vs INT4/INT8 Decode Summary
These rows separate decode-only throughput from TTFT and prefill. The INT4/INT8 16k input length was 15,488 tokens in this run; the FP4/FP8 16k input length was 15,872 tokens in the earlier run.
| Context | C | FP4/FP8 TTFT | INT4/INT8 TTFT | FP4/FP8 prefill tok/s/stream | INT4/INT8 prefill tok/s/stream | FP4/FP8 TPOT | INT4/INT8 TPOT | FP4/FP8 decode-only tok/s/stream | INT4/INT8 decode-only tok/s/stream | INT4/INT8 result |
|---|---|---|---|---|---|---|---|---|---|---|
| 16k | 1 | 15.78 s | 15.55 s | 1005.7 | 995.7 | 61.61 ms | 34.48 ms | 16.23 | 29.00 | pass |
| 16k | 2 | 21.11 s | n/a | 751.8 | n/a | 72.64 ms | n/a | 13.77 | n/a | stalled |
| 32k | 1 | 27.03 s | n/a | 1193.1 | n/a | 62.18 ms | n/a | 16.08 | n/a | failed |
- Downloads last month
- 13
Model tree for appmana/deepseek-v4-int4-int8
Base model
deepseek-ai/DeepSeek-V4-Flash