Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination Paper • 2507.10532 • Published Jul 14 • 85
Pre-Trained Policy Discriminators are General Reward Models Paper • 2507.05197 • Published Jul 7 • 39
Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric Paper • 2502.17184 • Published Feb 24 • 1