shenzhi-wang commited on
Commit
46b6c6f
·
verified ·
1 Parent(s): 8f69a30

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -16
README.md CHANGED
@@ -87,22 +87,46 @@ print(response)
87
  🔒: Proprietary
88
 
89
  ### 3.1 Arena-Hard-Auto
90
- | | Score | 95% CIs |
91
- | --------------------------------- | -------- | ----------- |
92
- | **Xwen-72B-Chat** 🔑 | **86.1** | (-1.5, 1.7) |
93
- | Qwen2.5-72B-Chat 🔑 | 63.3 | (-2.5, 2.3) |
94
- | Athene-v2-Chat 🔑 | 72.1 | (-2.5, 2.5) |
95
- | Llama-3.1-Nemotron-70B-Instruct 🔑 | 71.0 | (-2.8, 3.1) |
96
- | Llama-3.1-405B-Instruct-FP8 🔑 | 67.1 | (-2.2, 2.8) |
97
- | Claude-3-5-Sonnet-20241022 🔒 | **86.4** | (-1.3, 1.3) |
98
- | O1-Preview-2024-09-12 🔒 | 81.7 | (-2.2, 2.1) |
99
- | O1-Mini-2024-09-12 🔒 | 79.3 | (-2.8, 2.3) |
100
- | GPT-4-Turbo-2024-04-09 🔒 | 74.3 | (-2.4, 2.4) |
101
- | GPT-4-0125-Preview 🔒 | 73.6 | (-2.0, 2.0) |
102
- | GPT-4o-2024-08-06 🔒 | 71.1 | (-2.5, 2.0) |
103
- | Yi-Lightning 🔒 | 66.9 | (-3.3, 2.7) |
104
- | Yi-Large-Preview 🔒 | 65.1 | (-2.5, 2.5) |
105
- | GLM-4-0520 🔒 | 61.4 | (-2.6, 2.4) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
 
108
 
 
87
  🔒: Proprietary
88
 
89
  ### 3.1 Arena-Hard-Auto
90
+
91
+ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
92
+
93
+ #### 3.1.1 No Style Control
94
+
95
+ | | Score | 95% CIs |
96
+ | --------------------------------- | ------------------------ | ----------- |
97
+ | **Xwen-72B-Chat** 🔑 | **86.1** (Top-1 among 🔑) | (-1.5, 1.7) |
98
+ | Qwen2.5-72B-Chat 🔑 | 78.0 | (-1.8, 1.8) |
99
+ | Athene-v2-Chat 🔑 | 85.0 | (-1.4, 1.7) |
100
+ | Llama-3.1-Nemotron-70B-Instruct 🔑 | 84.9 | (-1.7, 1.8) |
101
+ | Llama-3.1-405B-Instruct-FP8 🔑 | 69.3 | (-2.4, 2.2) |
102
+ | Claude-3-5-Sonnet-20241022 🔒 | 85.2 | (-1.4, 1.6) |
103
+ | O1-Preview-2024-09-12 🔒 | **92.0** (Top-1 among 🔒) | (-1.2, 1.0) |
104
+ | O1-Mini-2024-09-12 🔒 | 90.4 | (-1.1, 1.3) |
105
+ | GPT-4-Turbo-2024-04-09 🔒 | 82.6 | (-1.8, 1.5) |
106
+ | GPT-4-0125-Preview 🔒 | 78.0 | (-2.1, 2.4) |
107
+ | GPT-4o-2024-08-06 🔒 | 77.9 | (-2.0, 2.1) |
108
+ | Yi-Lightning 🔒 | 81.5 | (-1.6, 1.6) |
109
+ | Yi-Large🔒 | 63.7 | (-2.6, 2.4) |
110
+ | GLM-4-0520 🔒 | 63.8 | (-2.9, 2.8) |
111
+
112
+ #### 3.1.2 Style Control
113
+
114
+ | | Score | 95% CIs |
115
+ | --------------------------------- | ------------------------ | ----------- |
116
+ | **Xwen-72B-Chat** 🔑 | **72.4** (Top-1 Among 🔑) | (-4.3, 4.1) |
117
+ | Qwen2.5-72B-Chat 🔑 | 63.3 | (-2.5, 2.3) |
118
+ | Athene-v2-Chat 🔑 | 72.1 | (-2.5, 2.5) |
119
+ | Llama-3.1-Nemotron-70B-Instruct 🔑 | 71.0 | (-2.8, 3.1) |
120
+ | Llama-3.1-405B-Instruct-FP8 🔑 | 67.1 | (-2.2, 2.8) |
121
+ | Claude-3-5-Sonnet-20241022 🔒 | **86.4** (Top-1 Among 🔒) | (-1.3, 1.3) |
122
+ | O1-Preview-2024-09-12 🔒 | 81.7 | (-2.2, 2.1) |
123
+ | O1-Mini-2024-09-12 🔒 | 79.3 | (-2.8, 2.3) |
124
+ | GPT-4-Turbo-2024-04-09 🔒 | 74.3 | (-2.4, 2.4) |
125
+ | GPT-4-0125-Preview 🔒 | 73.6 | (-2.0, 2.0) |
126
+ | GPT-4o-2024-08-06 🔒 | 71.1 | (-2.5, 2.0) |
127
+ | Yi-Lightning 🔒 | 66.9 | (-3.3, 2.7) |
128
+ | Yi-Large-Preview 🔒 | 65.1 | (-2.5, 2.5) |
129
+ | GLM-4-0520 🔒 | 61.4 | (-2.6, 2.4) |
130
 
131
 
132