How to set reasoning effort in the shown example?
Thanks.
How can we apply varying reasoning efforts in this example?
from transformers import pipeline
import torch
model_id = "openai/gpt-oss-120b"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
outputs = pipe(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
check below contents
I've seen that. Maybe it is a vLLM version? How to use transformers with different reasoning levels?
How did you do it? I'm trying to adjust the reasoning level within my llama.cpp command:
llama-server
--model /mnt/Storage-1-BK/LLMs/unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf
--threads 32
--ctx-size 32768
--n-gpu-layers 999
--device CUDA0,CUDA1,CUDA2,CUDA3
--split-mode layer
--temp 1.0
--min-p 0.0
--top-p 1.0
--top-k 0.0
--flash-attn
--jinja
--port 8000
--host 0.0.0.0
How did you do it? I'm trying to adjust the reasoning level within my llama.cpp command:
llama-server
--model /mnt/Storage-1-BK/LLMs/unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf
--threads 32
--ctx-size 32768
--n-gpu-layers 999
--device CUDA0,CUDA1,CUDA2,CUDA3
--split-mode layer
--temp 1.0
--min-p 0.0
--top-p 1.0
--top-k 0.0
--flash-attn
--jinja
--port 8000
--host 0.0.0.0
Sorry, I found that there are bugs when I use:
SYSTEM_PROMPT = """You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-06
Reasoning: high
# Valid channels: analysis, final. Channel must be included for every message."""
messages = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": question}]
The output is:
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-06
Reasoning: medium
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions
You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-06
Reasoning: high
# Valid channels: analysis, final. Channel must be included for every message.<|end|><|start|>user<|message|>Given a rational number, write it as a fraction in lowest terms and calculate the product of the resulting numerator and denominator. For how many rational numbers between 0 and 1 will $20_{}^{}!$ be the resulting product?<|end|><|start|>assistant<|channel|>analysis<|message|>We need to parse the problem: "Given a rational number, write it as a fraction in lowest terms and calculate the product of the resulting numerator and denominator. For how many rational numbers between 0 and 1 will $20!$ be the resulting product?" So we consider rational numbers r in (0,1). Write r = a/b in lowest terms (i.e., gcd(a,b)=1). Compute product a*b. We want a*b = 20!. Count how many such rational numbers between 0 and 1 satisfy that. So we need to count pairs (a,b) with 1 <= a < b, gcd(a,b)=1, and a*b = 20!. Also a,b positive integers. So we need to find all factorizations of 20! into two coprime factors a and b with a<b. Since a*b = N = 20!. We need to count number of unordered pairs (a,b) with a<b, gcd(a,b)=1, a*b=N. Equivalent to number of ways to split prime factors of N into two groups such that the two groups are coprime (i.e., no common prime factor). But since N's prime factorization includes each prime with some exponent. For a and b to be coprime, each prime's entire exponent must go to either a or b, not split. Because if a prime p appears in both a and b, they'd share p, gcd>1. So for each prime p dividing N, we must assign all p^e to either a or b. So the number of ways to assign each prime to one of the two numbers is 2^k where k is number of distinct primes dividing N. But we also need a<b. So we count half of them? But careful: if a=b, that would require a=b= sqrt(N). But N is not a perfect square? Let's check 20! prime exponents. 20! = product of primes <=20. Let's compute exponents: For each prime p <=20, exponent floor(20/p)+floor(20/p^2)+... . Let's compute:
You have to pass in reasoning_effort as a separate property not part of the messages: https://huggingface.co/openai/gpt-oss-120b/blob/main/chat_template.jinja#L267-L270
Could you share an example of how you might do that please?
Could you share an example of how you might do that please?
response = client.responses.create(
model=model,
input="Tell me a story",
instructions="You are a helfpul assistant.",
reasoning={
"effort": "low", # "low", "medium", or "high"
"summary": "auto", # "auto", "concise", or "detailed"
},
)
{ "model": "gpt-oss-20b", "messages": [{"role": "user", "content": "Let N denote the numbers of ordered triples of positive integers ( a , b , c ) such that a , b , c ≤ 3 6 and a 3 + b 3 + c 3 is a multiple of 3 7 . Find the remainder when N is divided by 1000 ."}], "max_tokens": 32000, "stream": true, "temperature": 0.6, "reasoning_effort": "low" }
I'm now running this model via vllm serve instead of llama.cpp, but specifying something like "Reasoning: High" at the start of my chat (similar to how Qwen does it with no_think) appears to have no effect at all on the reasoning effort being applied
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve openai/gpt-oss-120b --served-model-name VLLM-GPT-OSS-120b --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --max-model-len 131072 --max-num-seqs 1 --download-dir /mnt/Storage-1-BK/LLMs/VLLM-MODELS/
For setting this via llama.cpp (in case anyone else needs it) I was able to use the below
llama-server
--model /mnt/Storage-1-BK/LLMs/unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf
--threads 32
--ctx-size 131072
--n-gpu-layers 999
--device CUDA0,CUDA1,CUDA2,CUDA3
--split-mode layer
--temp 1.0
--min-p 0.0
--top-p 1.0
--top-k 0.0
--flash-attn
--jinja
--chat-template-kwargs '{"reasoning_effort": "low"}'
--temp 1.0
--port 8000
--host 0.0.0.0
Here is how I set the reasoning_effort
to low
while doing inference with Huggingface transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
messages = [
{"role": "user", "content": "Explain what MXFP4 quantization is."},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
reasoning_effort="low" # <---- adjusts the reasoning
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7
)
print(tokenizer.decode(outputs[0]))