openai/gpt-oss-20b · How to set reasoning effort in the shown example?

19 days ago

Thanks.

How can we apply varying reasoning efforts in this example?

from transformers import pipeline
import torch

model_id = "openai/gpt-oss-120b"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

jinuu

19 days ago

check below contents

https://huggingface.co/openai/gpt-oss-20b/discussions/28

TianheWu

18 days ago

check below contents

https://huggingface.co/openai/gpt-oss-20b/discussions/28

I've seen that. Maybe it is a vLLM version? How to use transformers with different reasoning levels?

TianheWu

18 days ago

This comment has been hidden (marked as Abuse)

Alphag0

18 days ago

How did you do it? I'm trying to adjust the reasoning level within my llama.cpp command:

llama-server
--model /mnt/Storage-1-BK/LLMs/unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf
--threads 32
--ctx-size 32768
--n-gpu-layers 999
--device CUDA0,CUDA1,CUDA2,CUDA3
--split-mode layer
--temp 1.0
--min-p 0.0
--top-p 1.0
--top-k 0.0
--flash-attn
--jinja
--port 8000
--host 0.0.0.0

TianheWu

18 days ago

How did you do it? I'm trying to adjust the reasoning level within my llama.cpp command:

llama-server
--model /mnt/Storage-1-BK/LLMs/unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf
--threads 32
--ctx-size 32768
--n-gpu-layers 999
--device CUDA0,CUDA1,CUDA2,CUDA3
--split-mode layer
--temp 1.0
--min-p 0.0
--top-p 1.0
--top-k 0.0
--flash-attn
--jinja
--port 8000
--host 0.0.0.0

Sorry, I found that there are bugs when I use:

SYSTEM_PROMPT = """You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-06

Reasoning: high

# Valid channels: analysis, final. Channel must be included for every message."""

messages = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": question}]

The output is:

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-06

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions

You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-06

Reasoning: high

# Valid channels: analysis, final. Channel must be included for every message.<|end|><|start|>user<|message|>Given a rational number, write it as a fraction in lowest terms and calculate the product of the resulting numerator and denominator. For how many rational numbers between 0 and 1 will $20_{}^{}!$ be the resulting product?<|end|><|start|>assistant<|channel|>analysis<|message|>We need to parse the problem: "Given a rational number, write it as a fraction in lowest terms and calculate the product of the resulting numerator and denominator. For how many rational numbers between 0 and 1 will $20!$ be the resulting product?" So we consider rational numbers r in (0,1). Write r = a/b in lowest terms (i.e., gcd(a,b)=1). Compute product a*b. We want a*b = 20!. Count how many such rational numbers between 0 and 1 satisfy that. So we need to count pairs (a,b) with 1 <= a < b, gcd(a,b)=1, and a*b = 20!. Also a,b positive integers. So we need to find all factorizations of 20! into two coprime factors a and b with a<b. Since a*b = N = 20!. We need to count number of unordered pairs (a,b) with a<b, gcd(a,b)=1, a*b=N. Equivalent to number of ways to split prime factors of N into two groups such that the two groups are coprime (i.e., no common prime factor). But since N's prime factorization includes each prime with some exponent. For a and b to be coprime, each prime's entire exponent must go to either a or b, not split. Because if a prime p appears in both a and b, they'd share p, gcd>1. So for each prime p dividing N, we must assign all p^e to either a or b. So the number of ways to assign each prime to one of the two numbers is 2^k where k is number of distinct primes dividing N. But we also need a<b. So we count half of them? But careful: if a=b, that would require a=b= sqrt(N). But N is not a perfect square? Let's check 20! prime exponents. 20! = product of primes <=20. Let's compute exponents: For each prime p <=20, exponent floor(20/p)+floor(20/p^2)+... . Let's compute:

dkundel-openai

OpenAI org 18 days ago

You have to pass in reasoning_effort as a separate property not part of the messages: https://huggingface.co/openai/gpt-oss-120b/blob/main/chat_template.jinja#L267-L270

Alphag0

17 days ago

Could you share an example of how you might do that please?

Combatti

16 days ago

Could you share an example of how you might do that please?

response = client.responses.create(
    model=model,
    input="Tell me a story",
    instructions="You are a helfpul assistant.",
    reasoning={
        "effort": "low", # "low", "medium", or "high"
        "summary": "auto", # "auto", "concise", or "detailed"
    }, 
)

Combatti

16 days ago

{ "model": "gpt-oss-20b", "messages": [{"role": "user", "content": "Let N denote the numbers of ordered triples of positive integers ( a , b , c ) such that a , b , c ≤ 3 6 and a 3 + b 3 + c 3 is a multiple of 3 7 . Find the remainder when N is divided by 1000 ."}], "max_tokens": 32000, "stream": true, "temperature": 0.6, "reasoning_effort": "low" }

Alphag0

16 days ago

•

edited 16 days ago

I'm now running this model via vllm serve instead of llama.cpp, but specifying something like "Reasoning: High" at the start of my chat (similar to how Qwen does it with no_think) appears to have no effect at all on the reasoning effort being applied

VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve openai/gpt-oss-120b --served-model-name VLLM-GPT-OSS-120b --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --max-model-len 131072 --max-num-seqs 1 --download-dir /mnt/Storage-1-BK/LLMs/VLLM-MODELS/

For setting this via llama.cpp (in case anyone else needs it) I was able to use the below

llama-server
--model /mnt/Storage-1-BK/LLMs/unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf
--threads 32
--ctx-size 131072
--n-gpu-layers 999
--device CUDA0,CUDA1,CUDA2,CUDA3
--split-mode layer
--temp 1.0
--min-p 0.0
--top-p 1.0
--top-k 0.0
--flash-attn
--jinja
--chat-template-kwargs '{"reasoning_effort": "low"}'
--temp 1.0
--port 8000
--host 0.0.0.0

dnouv

about 6 hours ago

•

edited about 6 hours ago

Here is how I set the reasoning_effort to low while doing inference with Huggingface transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
 
model_name = "openai/gpt-oss-20b"
 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
 
messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is."},
]
 
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    reasoning_effort="low"  # <---- adjusts the reasoning
).to(model.device)
 
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7
)
 
print(tokenizer.decode(outputs[0]))