pytorch
/

Qwen3-8B-INT4

Text Generation

text-generation-inference

Model card Files Files and versions

jerryzh168 commited on May 14

Commit

68c3564

·

verified ·

1 Parent(s): d21ea4d

Update README.md

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -228,8 +228,8 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
 | Benchmark        |                |                                |
 |------------------|----------------|--------------------------------|
-|                  | Phi-4 mini-Ins | Phi-4-mini-instruct-int4wo-hqq |
-| Peak Memory (GB) | 8.91           | 2.98 (67% reduction)           |
 ## Code Example
@@ -240,8 +240,8 @@ We can use the following code to get a sense of peak memory usage during inferen
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
-# use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-int4wo-hqq"
-model_id = "pytorch/Phi-4-mini-instruct-int4wo-hqq"
 quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
 tokenizer = AutoTokenizer.from_pretrained(model_id)

 | Benchmark        |                |                                |
 |------------------|----------------|--------------------------------|
+|                  | Qwen3-8B       | Qwen3-8B-int4wo-hqq            |
+| Peak Memory (GB) | 6.41           | 6.27 (TODO% reduction)         |
 ## Code Example
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+# use "Qwen/Qwen3-8B" or "pytorch/Qwen3-8B-int4wo-hqq"
+model_id = "pytorch/Qwen3-8B-int4wo-hqq"
 quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
 tokenizer = AutoTokenizer.from_pretrained(model_id)