How to set reasoning effort in the shown example?

#47
by TianheWu - opened

Thanks.

How can we apply varying reasoning efforts in this example?

from transformers import pipeline
import torch

model_id = "openai/gpt-oss-120b"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

check below contents

https://huggingface.co/openai/gpt-oss-20b/discussions/28

I've seen that. Maybe it is a vLLM version? How to use transformers with different reasoning levels?

This comment has been hidden (marked as Abuse)

How did you do it? I'm trying to adjust the reasoning level within my llama.cpp command:

llama-server
--model /mnt/Storage-1-BK/LLMs/unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf
--threads 32
--ctx-size 32768
--n-gpu-layers 999
--device CUDA0,CUDA1,CUDA2,CUDA3
--split-mode layer
--temp 1.0
--min-p 0.0
--top-p 1.0
--top-k 0.0
--flash-attn
--jinja
--port 8000
--host 0.0.0.0

How did you do it? I'm trying to adjust the reasoning level within my llama.cpp command:

llama-server
--model /mnt/Storage-1-BK/LLMs/unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf
--threads 32
--ctx-size 32768
--n-gpu-layers 999
--device CUDA0,CUDA1,CUDA2,CUDA3
--split-mode layer
--temp 1.0
--min-p 0.0
--top-p 1.0
--top-k 0.0
--flash-attn
--jinja
--port 8000
--host 0.0.0.0

Sorry, I found that there are bugs when I use:

SYSTEM_PROMPT = """You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-06

Reasoning: high

# Valid channels: analysis, final. Channel must be included for every message."""

messages = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": question}]

The output is:

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-06

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions

You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-06

Reasoning: high

# Valid channels: analysis, final. Channel must be included for every message.<|end|><|start|>user<|message|>Given a rational number, write it as a fraction in lowest terms and calculate the product of the resulting numerator and denominator. For how many rational numbers between 0 and 1 will $20_{}^{}!$ be the resulting product?<|end|><|start|>assistant<|channel|>analysis<|message|>We need to parse the problem: "Given a rational number, write it as a fraction in lowest terms and calculate the product of the resulting numerator and denominator. For how many rational numbers between 0 and 1 will $20!$ be the resulting product?" So we consider rational numbers r in (0,1). Write r = a/b in lowest terms (i.e., gcd(a,b)=1). Compute product a*b. We want a*b = 20!. Count how many such rational numbers between 0 and 1 satisfy that. So we need to count pairs (a,b) with 1 <= a < b, gcd(a,b)=1, and a*b = 20!. Also a,b positive integers. So we need to find all factorizations of 20! into two coprime factors a and b with a<b. Since a*b = N = 20!. We need to count number of unordered pairs (a,b) with a<b, gcd(a,b)=1, a*b=N. Equivalent to number of ways to split prime factors of N into two groups such that the two groups are coprime (i.e., no common prime factor). But since N's prime factorization includes each prime with some exponent. For a and b to be coprime, each prime's entire exponent must go to either a or b, not split. Because if a prime p appears in both a and b, they'd share p, gcd>1. So for each prime p dividing N, we must assign all p^e to either a or b. So the number of ways to assign each prime to one of the two numbers is 2^k where k is number of distinct primes dividing N. But we also need a<b. So we count half of them? But careful: if a=b, that would require a=b= sqrt(N). But N is not a perfect square? Let's check 20! prime exponents. 20! = product of primes <=20. Let's compute exponents: For each prime p <=20, exponent floor(20/p)+floor(20/p^2)+... . Let's compute:

You have to pass in reasoning_effort as a separate property not part of the messages: https://huggingface.co/openai/gpt-oss-120b/blob/main/chat_template.jinja#L267-L270

Could you share an example of how you might do that please?

Could you share an example of how you might do that please?

response = client.responses.create(
    model=model,
    input="Tell me a story",
    instructions="You are a helfpul assistant.",
    reasoning={
        "effort": "low", # "low", "medium", or "high"
        "summary": "auto", # "auto", "concise", or "detailed"
    }, 
)

{ "model": "gpt-oss-20b", "messages": [{"role": "user", "content": "Let N denote the numbers of ordered triples of positive integers ( a , b , c ) such that a , b , c ≤ 3 6 and a 3 + b 3 + c 3 is a multiple of 3 7 . Find the remainder when N is divided by 1000 ."}], "max_tokens": 32000, "stream": true, "temperature": 0.6, "reasoning_effort": "low" }

I'm now running this model via vllm serve instead of llama.cpp, but specifying something like "Reasoning: High" at the start of my chat (similar to how Qwen does it with no_think) appears to have no effect at all on the reasoning effort being applied

VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve openai/gpt-oss-120b --served-model-name VLLM-GPT-OSS-120b --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --max-model-len 131072 --max-num-seqs 1 --download-dir /mnt/Storage-1-BK/LLMs/VLLM-MODELS/

For setting this via llama.cpp (in case anyone else needs it) I was able to use the below

llama-server
--model /mnt/Storage-1-BK/LLMs/unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf
--threads 32
--ctx-size 131072
--n-gpu-layers 999
--device CUDA0,CUDA1,CUDA2,CUDA3
--split-mode layer
--temp 1.0
--min-p 0.0
--top-p 1.0
--top-k 0.0
--flash-attn
--jinja
--chat-template-kwargs '{"reasoning_effort": "low"}'
--temp 1.0
--port 8000
--host 0.0.0.0

Here is how I set the reasoning_effort to low while doing inference with Huggingface transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
 
model_name = "openai/gpt-oss-20b"
 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
 
messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is."},
]
 
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    reasoning_effort="low"  # <---- adjusts the reasoning
).to(model.device)
 
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7
)
 
print(tokenizer.decode(outputs[0]))

Sign up or log in to comment