omrisap
/

Qwen2.5-Math-1.5B-TreeRPO

Text Generation

reinforcement-learning

text-generation-inference

Model card Files Files and versions

omrisap commited on Jul 20

Commit

7a234fc

·

verified ·

1 Parent(s): 6d87d1b

Update README.md

Files changed (1) hide show

README.md +16 -5

README.md CHANGED Viewed

@@ -40,7 +40,7 @@ Research on hierarchical RL for reasoning; math tutoring prototypes with human o
 | Model | Greedy (%) | Maj@8 (%) | Notes |
 |-------|------------|-----------|-------|
 | Qwen2.5-Math-1.5B-Instruct | 84.8 | 89.5 | Reported settings |
-| **TreeRPO-Qwen2.5-Math-1.5B** | **86.4** | **89.6** | Same decoding (temp 0 / (0.7, top-p 0.8)) |
 - **Greedy** = temperature 0 deterministic decoding.
 - **Maj@8** = 8 sampled completions (temp 0.7, top-p 0.8) majority vote on final boxed answer. Ties / missing boxed answer → incorrect. Single-run numbers (no multi-seed variance).
@@ -55,7 +55,18 @@ model_name = "your-namespace/TreeRPO-Qwen2.5-Math-1.5B"
 tok = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
-prompt = "Solve step by step: If 3x + 5 = 17, what is x? Put final answer in \\boxed{}."
-inputs = tok(prompt, return_tensors="pt").to(model.device)
-out = model.generate(**inputs, max_new_tokens=256, temperature=0.0)
-print(tok.decode(out[0], skip_special_tokens=True))

 | Model | Greedy (%) | Maj@8 (%) | Notes |
 |-------|------------|-----------|-------|
 | Qwen2.5-Math-1.5B-Instruct | 84.8 | 89.5 | Reported settings |
+| **Qwen2.5-Math-1.5B-TreeRPO** | **86.4** | **89.6** | Same decoding (temp 0 / (0.7, top-p 0.8)) |
 - **Greedy** = temperature 0 deterministic decoding.
 - **Maj@8** = 8 sampled completions (temp 0.7, top-p 0.8) majority vote on final boxed answer. Ties / missing boxed answer → incorrect. Single-run numbers (no multi-seed variance).
 tok = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
+messages = [
+    {"role": "system", "content": "You are a helpful math reasoning assistant. Provide step-by-step reasoning and put the final answer in \\boxed{}."},
+    {"role": "user",   "content": "If 3x + 5 = 17, what is x?"}
+]
+prompt_text = tok.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+inputs = tok(prompt_text, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.0)
+print(tok.decode(outputs[0], skip_special_tokens=True))