amd
/

Llama-3.1-405B-Instruct-MXFP4-Preview

8-bit precision

Model card Files Files and versions

linzhao-amd commited on Jun 27

Commit

b2f44dc

·

verified ·

1 Parent(s): 71bfc2f

Update README.md

Files changed (1) hide show

README.md +0 -30

README.md CHANGED Viewed

@@ -38,36 +38,6 @@ python3 quantize_quark.py --model_dir "meta-llama/Meta-Llama-3.1-405B-Instruct"
 # Deployment
-## Use with vLLM
-This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
-```python
-from vllm import LLM, SamplingParams
-from transformers import AutoTokenizer
-model_id = "amd/Llama-3.1-405B-Instruct-MXFP4-Preview"
-number_gpus = 8
-sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-messages = [
-    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
-    {"role": "user", "content": "Who are you?"},
-]
-prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
-llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=4096)
-outputs = llm.generate(prompts, sampling_params)
-generated_text = outputs[0].outputs[0].text
-print(generated_text)
-```
 ## Evaluation
 The model was evaluated on MMLU and GSM8K_COT.

 # Deployment
 ## Evaluation
 The model was evaluated on MMLU and GSM8K_COT.