grok-1-W4A8KV8 / README.md
bowenbaoamd's picture
Upload folder using huggingface_hub
5e973ea
|
raw
history blame
2.12 kB
metadata
license: apache-2.0
base_model: lmzheng/grok-1

Grok-1-W4A8KV8

Introduction

This model was created by applying Quark with calibration samples from Pile dataset.

Quantization Stragegy

  • Quantized Layers: All linear layers excluding "lm_head", "*.gate"
  • Weight: FP8 symmetric per-tensor, additionally, INT4 symmetric per-channel for MoE linear
  • Activation: FP8 symmetric per-tensor
  • KV Cache: FP8 symmetric per-tensor

INT4 Packing

Every eight int4 values are packed into a single int32 integeter following the sequence defined by order_map = [0, 2, 4, 6, 1, 3, 5, 7].

Quick Start

  1. Download and install Quark
  2. Run the quantization script in the example folder using the following command line:
export MODEL_DIR = [local model checkpoint folder] or lmzheng/grok-1 
python3 quantize_quark.py \
        --model_dir $MODEL_DIR \
        --output_dir grok-1-W4A8KV8 \
        --quant_scheme TBD \
        --kv_cache_dtype fp8 \
        --num_calib_data 128 \
        --model_export hf_format \
        --multi_gpu \
        --custom_mode fp8

Deployment

Quark has its own export format and allows FP8 quantized models to be efficiently deployed using the SGLang backend.

Evaluation

Quark currently uses perplexity(PPL) as the evaluation metric for accuracy loss before and after quantization.The specific PPL algorithm can be referenced in the quantize_quark.py. The quantization evaluation results are conducted in pseudo-quantization mode, which may slightly differ from the actual quantized inference accuracy. These results are provided for reference only.

Evaluation scores

Benchmark grok-1 grok-1-W4A8KV8(this model)
Perplexity-wikitext2 TBD TBD

License

Modifications copyright(c) 2024 Advanced Micro Devices,Inc. All rights reserved.