DeepSeek V4 Flash dsv4_int INT4/INT8

This checkpoint is for the AppMana Ampere vLLM fork:

https://github.com/AppMana/forks-vllm-ampere

The container image used for serving is published from that fork.

This checkpoint sets:

"__experimental_enable_imma_from_https://github.com/appMana/forks-vllm-ampere": true

That field is read by the AppMana fork to select the IMMA INT8 paths without requiring deployment-specific environment variables.

Quantization

Routed expert weights are converted ahead of time from the native DeepSeek-V4 MXFP4 expert tensors to INT4 W4A16 for Ampere.

For each expert weight tensor:

  1. MXFP4 bytes are unpacked into e2m1 values.
  2. The native e8m0 scale is applied to recover FP32 group values.
  3. Values are grouped by 32 along the last dimension.
  4. For scale_mode="mse", the converter tries per-group candidate scales abs_max / div, with div from 5.0 to 9.5.
  5. Each candidate scale is rounded to BF16 before scoring.
  6. For each candidate, INT4 codes are recomputed with round(group / scale).clamp(-8, 7).
  7. The selected scale is the candidate that minimizes sum((q * scale - group) ** 2) for that group.

"MSE scale" means the scale is part of the round-trip quantize/dequantize error being minimized. It is not a fixed-scale integer-code MSE, and it is not a closed-form least-squares refit after choosing codes.

Stored expert format:

  • signed INT4 values encoded as uint4b8 nibbles (code + 8)
  • BF16 per-group scales
  • group size 32

Dense/shared expert/attention FP8 weights are converted to the AppMana INT8 W8A16 path. Where applicable, the stored tensor is channelwise biased UINT8 plus scale metadata.

Inference implementation

Stage Technology Source
Routed expert GEMM Marlin MoE W4A16 INT4 vLLM upstream path with AppMana dsv4_int checkpoint loader in https://github.com/AppMana/forks-vllm-ampere
Dense and shared-expert GEMM AllSpark W8A16 INT8 AppMana integration in https://github.com/AppMana/forks-vllm-ampere
Sparse MLA prefill/decode attention AppMana CUDA Flash MLA / Sparse MLA kernel https://github.com/AppMana/forks-flash-mla-ampere-dsv4
KV transfer LMCache connector sidecar https://github.com/AppMana/forks-lmcache

The quantization loader and conversion metadata are implemented in:

https://github.com/AppMana/forks-vllm-ampere/blob/appmana/vllm-ampere/vllm/model_executor/layers/quantization/dsv4_int.py

Random-prefix serving benchmarks

These benchmarks use random prompts with --random-prefix-len 0, so LMCache and vLLM prefix-cache hits are not part of the result. The serving topology is PP=10, TP=1 on RTX 3090 nodes, 512 generated tokens, --ignore-eos, --temperature 0, --max-num-batched-tokens 2048, FP8 KV cache, and max_num_seqs=6.

Command shape:

vllm bench serve \
  --backend openai \
  --base-url http://127.0.0.1:8080 \
  --endpoint /v1/completions \
  --model MODEL_ID \
  --dataset-name random \
  --random-input-len INPUT_TOKENS \
  --random-output-len 512 \
  --random-prefix-len 0 \
  --num-prompts C \
  --max-concurrency C \
  --request-rate inf \
  --ignore-eos \
  --temperature 0 \
  --percentile-metrics ttft,tpot,itl,e2el

Vanilla DeepSeek-V4-Flash FP4/FP8

Model: deepseek-ai/DeepSeek-V4-Flash.

Observed implementation path:

Stage Observed path
Checkpoint quantization quant_method=fp8, fmt=e4m3, scale_fmt=ue8m0, weight_block_size=[128,128]
Dense FP8 linear MarlinFP8ScaledMMLinearKernel for Fp8LinearMethod; DeepGEMM UE8M0 config enabled
Routed experts expert_dtype='fp4', Using 'MARLIN' Mxfp4 MoE backend
Attention / MLA DeepSeek V4 NVIDIA FlashMLA path; 200k OOM stack was in vllm/models/deepseek_v4/nvidia/flashmla.py:127
KV cache kv_cache_dtype=fp8, Using DeepSeek's fp8_ds_mla KV cache format, FP8 indexer cache for Lightning Indexer
Compilation vLLM compile mode and decode CUDA graphs were enabled, but workers logged that this DeepSeek-V4 model does not support torch.compile; monitored JIT mode was active
KV transfer LMCache MP connector using lmcache.c_ops; benchmark prompts had 0.0% prefix-cache and external-prefix-cache hit rates
Context C Input tokens Result TTFT mean Prefill tok/s/stream TPOT mean Decode tok/s/stream
16k 1 15,872 pass 15.78 s 1005.7 61.61 ms 16.23
16k 2 15,872 pass 21.11 s 751.8 72.64 ms 13.77
16k 4 15,872 pass 31.97 s 496.5 93.50 ms 10.70
32k 1 32,256 pass 27.03 s 1193.1 62.18 ms 16.08
32k 2 32,256 pass 38.33 s 841.6 83.77 ms 11.94
32k 4 32,256 pass 60.99 s 528.9 128.00 ms 7.81
64k 1 65,024 pass 51.14 s 1271.4 62.93 ms 15.89
64k 2 65,024 pass 73.59 s 883.6 107.88 ms 9.27
64k 4 65,024 pass 121.81 s 533.8 198.18 ms 5.05
200k 1 199,488 OOM n/a n/a n/a n/a

The 200k C=1 request failed after startup with CUDA OOM on PP rank 9 while executing FlashMLA. The failed worker reported a 114 MiB allocation request with 95.75 MiB free; scheduler stats showed KV cache usage at 19.2%, so the failure was rank-local VRAM headroom rather than exhausting the configured KV cache.

AppMana DeepSeek-V4-Flash INT4/INT8

Model: appmana/deepseek-v4-int4-int8.

The INT4/INT8 runs below used the same PP=10, TP=1 serving topology and the same random-prefix benchmark shape as above. Decode-only throughput is computed as 1000 / TPOT_ms, so it is per active stream and separate from vLLM's aggregate output-token throughput.

Observed implementation path:

Stage Observed path
Checkpoint quantization Routed experts converted to INT4 W4A16; dense/shared expert/attention FP8 weights converted to AppMana INT8 W8A16 metadata
Dense and shared-expert GEMM AllSpark W8A16 INT8 for most dense paths; dequant_channel_bf16_wo_a fallback appears for .attn.wo_a
Routed experts Marlin MoE W4A16 INT4
Attention / MLA AppMana DeepSeek V4 sparse MLA path in the vLLM fork
KV cache FP8 KV cache with DeepSeek V4 indexer block size 256
KV transfer LMCache MP connector using lmcache.c_ops; random-prefix benchmark rows showed no prefix-cache hits
Context C Input tokens Result TTFT mean Prefill tok/s/stream TPOT mean Decode-only tok/s/stream vLLM output tok/s
1k 1 1,000 pass 3.10 s 322.1 34.09 ms 29.33 24.94
16k 1 15,488 pass 15.55 s 995.7 34.48 ms 29.00 15.43
16k 2 15,488 stalled n/a n/a n/a n/a n/a
32k 1 31,488 failed n/a n/a n/a n/a n/a

The 16k C=2 row accepted both streams but did not make decode progress: metrics showed vllm:num_requests_running=2, vllm:generation_tokens_total=1, and no completed requests after several minutes. The client was still waiting for streamed chunks when it was stopped.

The 32k C=1 row failed after the request started. PP rank 7 reported CUDA error: an illegal memory access was encountered in vllm/v1/worker/gpu/buffer_utils.py:37 while copying idx_mapping_np to the GPU from prepare_inputs; PP rank 8 then failed receiving from the previous pipeline stage. vLLM recorded zero successful requests for that row.

FP4/FP8 vs INT4/INT8 Decode Summary

These rows separate decode-only throughput from TTFT and prefill. The INT4/INT8 16k input length was 15,488 tokens in this run; the FP4/FP8 16k input length was 15,872 tokens in the earlier run.

Context C FP4/FP8 TTFT INT4/INT8 TTFT FP4/FP8 prefill tok/s/stream INT4/INT8 prefill tok/s/stream FP4/FP8 TPOT INT4/INT8 TPOT FP4/FP8 decode-only tok/s/stream INT4/INT8 decode-only tok/s/stream INT4/INT8 result
16k 1 15.78 s 15.55 s 1005.7 995.7 61.61 ms 34.48 ms 16.23 29.00 pass
16k 2 21.11 s n/a 751.8 n/a 72.64 ms n/a 13.77 n/a stalled
32k 1 27.03 s n/a 1193.1 n/a 62.18 ms n/a 16.08 n/a failed
Downloads last month
13
Safetensors
Model size
159B params
Tensor type
BF16
·
F32
·
I64
·
I8
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for appmana/deepseek-v4-int4-int8

Quantized
(84)
this model
Free AI Image Generator No sign-up. Instant results. Open Now