grok-1-W4A8KV8 / README.md

bowenbaoamd

Upload folder using huggingface_hub

5e973ea 8 months ago

preview code

raw

history blame

2.12 kB

metadata

license: apache-2.0
base_model: lmzheng/grok-1

Grok-1-W4A8KV8

Introduction

This model was created by applying Quark with calibration samples from Pile dataset.

Quantization Stragegy

Quantized Layers: All linear layers excluding "lm_head", "*.gate"
Weight: FP8 symmetric per-tensor, additionally, INT4 symmetric per-channel for MoE linear
Activation: FP8 symmetric per-tensor
KV Cache: FP8 symmetric per-tensor

INT4 Packing

Every eight int4 values are packed into a single int32 integeter following the sequence defined by order_map = [0, 2, 4, 6, 1, 3, 5, 7].

Quick Start

Download and install Quark
Run the quantization script in the example folder using the following command line:

export MODEL_DIR = [local model checkpoint folder] or lmzheng/grok-1 
python3 quantize_quark.py \
        --model_dir $MODEL_DIR \
        --output_dir grok-1-W4A8KV8 \
        --quant_scheme TBD \
        --kv_cache_dtype fp8 \
        --num_calib_data 128 \
        --model_export hf_format \
        --multi_gpu \
        --custom_mode fp8

Deployment

Quark has its own export format and allows FP8 quantized models to be efficiently deployed using the SGLang backend.

Evaluation

Quark currently uses perplexity(PPL) as the evaluation metric for accuracy loss before and after quantization.The specific PPL algorithm can be referenced in the quantize_quark.py. The quantization evaluation results are conducted in pseudo-quantization mode, which may slightly differ from the actual quantized inference accuracy. These results are provided for reference only.

Evaluation scores

Benchmark	grok-1	grok-1-W4A8KV8(this model)
Perplexity-wikitext2	TBD	TBD

amd
/

grok-1-W4A8KV8