Running gpt-oss-20b on an RTX 4070 Ti (12GB) using Transformers

#110

by Biiigstone - opened 13 days ago

13 days ago

Hi everyone,
I'd like to share the method I used to run the gpt-oss-20b model on a single RTX 4070Ti (12GB VRAM) using the transformers library.

First, as it says in the guide, the MXFP4 quantized model cannot be used on 40-series cards.
Therefore, you need to get the original model by giving it the option to de-quantize.
In this case, if you use device_map='auto', a KeyError will occur because the model is distributed across multiple devices.
Therefore, you should load it only on the CPU. After that, save it locally.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils.quantization_config import Mxfp4Config

model_id = "openai/gpt-oss-20b"
save_path = './gpt-oss-model-local'
try:

    quantization_config = Mxfp4Config(dequantize=True)

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=quantization_config,
        torch_dtype=torch.bfloat16,
        device_map="cpu"
    )

    tokenizer = AutoTokenizer.from_pretrained(model_id)


    model.save_pretrained(save_path)
    tokenizer.save_pretrained(save_path)


except Exception as e:
    print(e)

Next, load the de-quantized model you saved locally by quantizing it with bnb.
If you use device_map='auto', a VRAM OOM will occur, so you need to map it manually.
In the case of the 4070 Ti, it was possible to put a maximum of 15 layers on the GPU, and setting it higher than this caused an OOM.

For those of you using a better 40-series GPU, you can modify this part (and I would be grateful if you could let me know the results).


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import time


model_path = './gpt-oss-model-local'

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,     
    bnb_4bit_quant_type="nf4",           
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_enable_fp32_cpu_offload=True
)

num_gpu_layers = 15
num_total_layers = 24


device_map = {
    "model.embed_tokens": 0,
    **{f"model.layers.{i}": 0 for i in range(num_gpu_layers)},
    **{f"model.layers.{i}": "cpu" for i in range(num_gpu_layers, num_total_layers)},
    "model.norm": "cpu",
    "lm_head": "cpu"
}

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=quantization_config,
    # attn_implementation="flash_attention_3",
    # attn_implementation="sdpa",
    device_map=device_map
)

print(f"current attention impl: {model.config._attn_implementation}")

tokenizer = AutoTokenizer.from_pretrained(model_path)

messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is in simple terms."},
]


inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)


max_new_tokens = 1
start_time = time.perf_counter()
outputs = model.generate(
    **inputs,
    max_new_tokens=max_new_tokens,
    temperature=0.7
)

print(tokenizer.decode(outputs[0]))
end_time = time.perf_counter()
elapsed_time = end_time - start_time
print("inf end")


print(tokenizer.decode(outputs[0]))
print("\n" + "="*30)
print(f"elapsed time: {elapsed_time:.2f}sec")
print("="*30)

Additionally, it seems flash attention does not work. Therefore, the basic operation is eager. If there's anyone who has it working, I would appreciate it if you could share.

When set to a reasoning:medium level, it took about 4 to 5 minutes to infer a single token.
As a result, it seems difficult to use gpt-oss-20b at a 4-bit quantization level on a GPU like the RTX 4070 Ti.

If there's anything I've missed, or if anyone has had success operating this on a RTX 40 GPU, please be sure to let me know.

Thanks

Zistrosk

1 day ago

Hey Biiigstone,

I have a single RTX 4070 12GB (Ubuntu 24.04, 4-core i7-6700K, 64 gig ram, 1tb pcie ssd), and I think I got this running? Or maybe I got an error message. I'm really not sure.

I used your first set of brilliant code above, and it worked great (after installing torch). I already had downloaded the 20b model so I pointed it at that directory, and it ran some checkpoints and was complete pretty quickly.

Then I ran your second set of code against the de-quantized model and started with this value you included:
num_gpu_layers = 15

It kept giving me out of memory errors as I decreased the value by 1, until I got down to
num_gpu_layers = 6

At that point it gave me this output (yes I named the second python program above "get20b-running.py"):
...
./get20b-running.py
Loading checkpoint shards: 100%|█████████████████████████████████████████| 9/9 [00:33<00:00, 3.73s/it]
Some parameters are on the meta device because they were offloaded to the cpu.
current attention impl: eager
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-22

Reasoning: medium

Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Explain what MXFP4 quantization is in simple terms.<|end|><|start|>assistant<|channel|>

inf end
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-22

Reasoning: medium

Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Explain what MXFP4 quantization is in simple terms.<|end|><|start|>assistant<|channel|>

==============================
elapsed time: 5.46sec

...
I'm not sure that worked right? I don't know. The whole bit about "valid channels" kind of threw me off. I know that's related to a jinja template somewhere (not sure where) and the method to communicate with the model (not sure what it is exactly), but a couple dozen of stabs in the dark with code like this didn't seem to help:

messages = [
{"role": "user", "content": "Please explain why the sky is blue."},
{"role": "assistant", "channel": "final"},
]

Mostly I ended up with error messages saying that the new values I added to messages were not defined in the dictionary:
jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'content'

As an aside, I ran transformers serve after this, and the chat interface using transformers chat localhost:8000 --model-name-or-path /gitrepos/gpt-oss-20b in a different terminal window. I put in "why is the sky blue" and it took about 30 minutes total to grind the answer out and print it to my screen (not the 5.46 seconds I got above, unfortunately).

Thanks,
Zisty

Zistrosk

about 10 hours ago

Okay, here's an update. I took some time to try to figure out what was going on since I wasn't getting an answer generated for my prompt. It's probably obvious to anyone that's been doing this a while, but running my own llm is pretty new to me. It took me some time to determine that the setting above, "max_new_tokens = 1" really means "don't give me any results from my prompt". Great for measuring the time it takes to run the initial question against GPU memory resident models - but I wanted to get the answer too. Also, I found that I had to lower to "num_gpu_layers = 5" not 6, to eliminate all out of memory errors.

So here are some timings with various max_new_tokens sizes, when prompting the gpt-oss-20b model with "why is the sky blue":

5.46 seconds - max_new_tokens=1 (but no output!)
598.44 seconds - max_new_tokens=128
1289.31 seconds - max_new_tokens=256
2381.94 seconds - max_new_tokens=512
8046.80 seconds - max_new_tokens=4096 - this generated a complete answer along with the thinking steps

It the time increase seems roughly linear. The output cut off at the desired token limit, but was still in "thinking" mode in all but the last run, so the real end "answer" was only output at 4096 tokens.

The RTX 4070 ran at around 10 gig memory in use and 96% busy, but only around 50 watts of a 200 watt max, for most of the time the model inference was running. The CPU was 100% busy on just one core, the other cores were idle. I'm sure my time would have been faster if I had a better CPU in this test bench system, or a bigger badder video card. Using "glances" to monitor the system, as well as "nvidia-smi -l 2", the biggest WAIT time listed was for I/O copying the model from the PCIE SSD to the GPU. This system isn't particularly fast in that area, so that makes sense.

I think I'm going to look around for a smaller model, something that might have a chance of fitting on the memory in my GPU.

Again, BIG HUGE THANKS to the original poster. This code really helped me get started with local inference while avoiding out of memory errors.

Kind regards,
Zisty

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Running gpt-oss-20b on an RTX 4070 Ti (12GB) using Transformers

Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Explain what MXFP4 quantization is in simple terms.<|end|><|start|>assistant<|channel|>

Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Explain what MXFP4 quantization is in simple terms.<|end|><|start|>assistant<|channel|>

==============================elapsed time: 5.46sec

==============================
elapsed time: 5.46sec