Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -39,9 +39,10 @@ For more information about AceMath, check our [website](https://research.nvidia. | |
| 39 | 
             
            ## Benchmark Results (AceMath-Instruct + AceMath-72B-RM)
         | 
| 40 |  | 
| 41 | 
             
            <p align="center">
         | 
| 42 | 
            -
              <img src=" | 
| 43 | 
             
            </p>
         | 
| 44 |  | 
|  | |
| 45 | 
             
            We compare AceMath to leading proprietary and open-access math models in above Table. Our AceMath-7B-Instruct, largely outperforms the previous best-in-class Qwen2.5-Math-7B-Instruct (Average pass@1: 67.2 vs. 62.9) on a variety of math reasoning benchmarks, while coming close to the performance of 10× larger Qwen2.5-Math-72B-Instruct (67.2 vs. 68.2). Notably, our AceMath-72B-Instruct outperforms the state-of-the-art Qwen2.5-Math-72B-Instruct (71.8 vs. 68.2), GPT-4o (67.4) and Claude 3.5 Sonnet (65.6) by a margin. We also report the rm@8 accuracy (best of 8) achieved by our reward model, AceMath-72B-RM, which sets a new record on these reasoning benchmarks. This excludes OpenAI’s o1 model, which relies on scaled inference computation.
         | 
| 46 |  | 
| 47 |  | 
|  | |
| 39 | 
             
            ## Benchmark Results (AceMath-Instruct + AceMath-72B-RM)
         | 
| 40 |  | 
| 41 | 
             
            <p align="center">
         | 
| 42 | 
            +
              <img src="./acemath-pic.png" alt="AceMath Benchmark Results" width="800">
         | 
| 43 | 
             
            </p>
         | 
| 44 |  | 
| 45 | 
            +
             | 
| 46 | 
             
            We compare AceMath to leading proprietary and open-access math models in above Table. Our AceMath-7B-Instruct, largely outperforms the previous best-in-class Qwen2.5-Math-7B-Instruct (Average pass@1: 67.2 vs. 62.9) on a variety of math reasoning benchmarks, while coming close to the performance of 10× larger Qwen2.5-Math-72B-Instruct (67.2 vs. 68.2). Notably, our AceMath-72B-Instruct outperforms the state-of-the-art Qwen2.5-Math-72B-Instruct (71.8 vs. 68.2), GPT-4o (67.4) and Claude 3.5 Sonnet (65.6) by a margin. We also report the rm@8 accuracy (best of 8) achieved by our reward model, AceMath-72B-RM, which sets a new record on these reasoning benchmarks. This excludes OpenAI’s o1 model, which relies on scaled inference computation.
         | 
| 47 |  | 
| 48 |  | 
