Update README.md
Browse files
README.md
CHANGED
@@ -3,7 +3,7 @@ license: mit
|
|
3 |
language:
|
4 |
- ru
|
5 |
base_model:
|
6 |
-
- t-tech/T-
|
7 |
pipeline_tag: text-generation
|
8 |
library_name: transformers
|
9 |
tags:
|
@@ -12,13 +12,13 @@ metrics:
|
|
12 |
- mae
|
13 |
- pearsonr
|
14 |
---
|
15 |
-
# pollux-judge-
|
16 |
|
17 |
<!-- Provide a quick summary of what the model is/does. -->
|
18 |
|
19 |

|
20 |
|
21 |
-
pollux-judge-
|
22 |
The model assesses answer quality given input instruction, specific criteria and rubrics, providing automated LLM performance evaluation for Russian-language tasks.
|
23 |
|
24 |
## Model Details
|
@@ -27,10 +27,10 @@ The model assesses answer quality given input instruction, specific criteria and
|
|
27 |
|
28 |
<!-- Provide a longer summary of what this model is. -->
|
29 |
|
30 |
-
pollux-judge-
|
31 |
At the heart of this project lies the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX), which introduces systematic taxonomies for both generative tasks and evaluation criteria, providing quantitative and qualitative assessments of responses from top-tier LLMs.
|
32 |
|
33 |
-
Built upon the [t-tech/T-
|
34 |
The model is designed to predict both numerical scores and detailed textual rationales with separate heads based on the original instruction, the LLM's response, specific evaluation criteria, scoring rubrics, and reference answers when available.
|
35 |
|
36 |
While the model is technically capable of processing any type of instruction and criterion when properly formatted, its training has been specifically optimized using the generative tasks and evaluation criteria derived from the taxonomies established within the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
|
@@ -39,7 +39,7 @@ While the model is technically capable of processing any type of instruction and
|
|
39 |
- **Model type:** decoder
|
40 |
- **Language(s) (NLP):** Russian
|
41 |
- **License:** MIT
|
42 |
-
- **Finetuned from model:** [t-tech/T-
|
43 |
|
44 |
### Model Sources
|
45 |
|
@@ -56,7 +56,7 @@ While the model is technically capable of processing any type of instruction and
|
|
56 |
|
57 |
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
58 |
|
59 |
-
pollux-judge-
|
60 |
The model operates optimally when provided with all essential components: the source instruction, the response to be evaluated (typically generated by another LLM), the specific evaluation criterion, and its corresponding scoring rubrics.
|
61 |
|
62 |
|
@@ -74,7 +74,7 @@ For optimal performance and reliable results, users should structure each evalua
|
|
74 |
|
75 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
76 |
|
77 |
-
All content, responses, and outputs generated by pollux-judge-
|
78 |
Such outputs do not constitute statements, opinions, recommendations, or positions of the model developers, publishers, or affiliated entities (collectively, the "Developers").
|
79 |
|
80 |
The Model's outputs do not represent, reflect, or endorse any views, beliefs, policies, or positions held by the Developers.
|
@@ -137,7 +137,7 @@ prompt = PROMPT_TEMPLATE.format(instruction=instruction,
|
|
137 |
criteria_name=criteria_name,
|
138 |
criteria_rubrics=criteria_rubrics)
|
139 |
|
140 |
-
MODEL_PATH = "ai-forever/pollux-judge-
|
141 |
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
|
142 |
model = AutoModelForCausalLM.from_pretrained(
|
143 |
MODEL_PATH,
|
@@ -227,7 +227,7 @@ Note this provides both in- and out-of-domain evaluation as some of the tasks an
|
|
227 |
|
228 |
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
229 |
|
230 |
-
We employed **Spearman’s rank correlation** with expert judgements and **Mean Absolute Error (MAE)** metrics alongside the Verdict Confidence to assess the performance of pollux-judge-
|
231 |
|
232 |
MAE offers a high degree of interpretability, as it is measured on the same scale as the annotation – specifically, in points.
|
233 |
On the other hand, Spearman’s rank correlation allows to quantify the degree of monotonic association between the two rankings of models outputs and
|
@@ -245,42 +245,42 @@ For detailed evaluation results see Appendix D in the [preprint](https://arxiv.o
|
|
245 |
|
246 |
Spearman’s rank correlation:
|
247 |
|
248 |
-
| Model | pollux-judge-
|
249 |
| --- | --- | --- | --- | --- |
|
250 |
-
| Claude 3.5 Sonnet (2024-10-22) | 0.
|
251 |
-
| GPT-4o (2024-08-06) |
|
252 |
-
| GigaChat-Max (1.0.26.20) | 0.
|
253 |
-
| Llama-3.1-405B | 0.
|
254 |
-
| T-pro-it-1.0 |
|
255 |
-
| YaGPT-4-Pro (2024-10-23) | 0.
|
256 |
-
|o1 (2024-12-17) | 0.
|
257 |
-
| Avg. | 0.
|
258 |
|
259 |
MAE (MAE values are given in parenthesis):
|
260 |
|
261 |
-
| Model | pollux-judge-
|
262 |
| --- | --- | --- | --- | --- |
|
263 |
-
| Claude 3.5 Sonnet (2024-10-22) | 0.
|
264 |
-
| GPT-4o (2024-08-06) | 0.
|
265 |
-
| GigaChat-Max (1.0.26.20) | 0.
|
266 |
-
| Llama-3.1-405B | 0.
|
267 |
-
| T-pro-it-1.0 | 0.
|
268 |
-
| YaGPT-4-Pro (2024-10-23) | 0.
|
269 |
-
|o1 (2024-12-17) | 0.
|
270 |
-
| Avg. | 0.
|
271 |
|
272 |
Verdict Confidence (calculated on the whole test sample):
|
273 |
|
274 |
-
| Model | pollux-judge-
|
275 |
| --- | --- | --- | --- | --- |
|
276 |
-
| Claude 3.5 Sonnet (2024-10-22) | 0.
|
277 |
-
| GPT-4o (2024-08-06) | 0.
|
278 |
-
| GigaChat-Max (1.0.26.20) | 0.
|
279 |
-
| Llama-3.1-405B | 0.
|
280 |
-
| T-pro-it-1.0 | 0.
|
281 |
-
| YaGPT-4-Pro (2024-10-23) | 0.
|
282 |
-
|o1 (2024-12-17) | 0.
|
283 |
-
| Avg. | 0.
|
284 |
|
285 |
|
286 |
## Technical Specifications
|
|
|
3 |
language:
|
4 |
- ru
|
5 |
base_model:
|
6 |
+
- t-tech/T-pro-it-1.0
|
7 |
pipeline_tag: text-generation
|
8 |
library_name: transformers
|
9 |
tags:
|
|
|
12 |
- mae
|
13 |
- pearsonr
|
14 |
---
|
15 |
+
# pollux-judge-32b-r
|
16 |
|
17 |
<!-- Provide a quick summary of what the model is/does. -->
|
18 |
|
19 |

|
20 |
|
21 |
+
pollux-judge-32b-r is a 32-billion parameter generative language model specifically designed to evaluate the quality of other language models' responses in Russian.
|
22 |
The model assesses answer quality given input instruction, specific criteria and rubrics, providing automated LLM performance evaluation for Russian-language tasks.
|
23 |
|
24 |
## Model Details
|
|
|
27 |
|
28 |
<!-- Provide a longer summary of what this model is. -->
|
29 |
|
30 |
+
pollux-judge-32b-r is an integral component of the POLLUX project, a comprehensive initiative dedicated to evaluating the generative capabilities of Large Language Models (LLMs).
|
31 |
At the heart of this project lies the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX), which introduces systematic taxonomies for both generative tasks and evaluation criteria, providing quantitative and qualitative assessments of responses from top-tier LLMs.
|
32 |
|
33 |
+
Built upon the [t-tech/T-pro-it-1.0](https://huggingface.co/t-tech/T-pro-it-1.0) architecture, pollux-judge-32b-r is a decoder-based 32 billion parameter model trained with a combination of Mean Square Error (for regression head) and Cross-Entropy (for language modeling head) objectives.
|
34 |
The model is designed to predict both numerical scores and detailed textual rationales with separate heads based on the original instruction, the LLM's response, specific evaluation criteria, scoring rubrics, and reference answers when available.
|
35 |
|
36 |
While the model is technically capable of processing any type of instruction and criterion when properly formatted, its training has been specifically optimized using the generative tasks and evaluation criteria derived from the taxonomies established within the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
|
|
|
39 |
- **Model type:** decoder
|
40 |
- **Language(s) (NLP):** Russian
|
41 |
- **License:** MIT
|
42 |
+
- **Finetuned from model:** [t-tech/T-pro-it-1.0](https://huggingface.co/t-tech/T-pro-it-1.0)
|
43 |
|
44 |
### Model Sources
|
45 |
|
|
|
56 |
|
57 |
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
58 |
|
59 |
+
pollux-judge-32b-r is specifically designed for assessing text responses against a single, predefined criterion per evaluation run.
|
60 |
The model operates optimally when provided with all essential components: the source instruction, the response to be evaluated (typically generated by another LLM), the specific evaluation criterion, and its corresponding scoring rubrics.
|
61 |
|
62 |
|
|
|
74 |
|
75 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
76 |
|
77 |
+
All content, responses, and outputs generated by pollux-judge-32b-r (the "Model") are produced through automated computational processes based on statistical patterns learned from pre-training data.
|
78 |
Such outputs do not constitute statements, opinions, recommendations, or positions of the model developers, publishers, or affiliated entities (collectively, the "Developers").
|
79 |
|
80 |
The Model's outputs do not represent, reflect, or endorse any views, beliefs, policies, or positions held by the Developers.
|
|
|
137 |
criteria_name=criteria_name,
|
138 |
criteria_rubrics=criteria_rubrics)
|
139 |
|
140 |
+
MODEL_PATH = "ai-forever/pollux-judge-32b-r"
|
141 |
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
|
142 |
model = AutoModelForCausalLM.from_pretrained(
|
143 |
MODEL_PATH,
|
|
|
227 |
|
228 |
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
229 |
|
230 |
+
We employed **Spearman’s rank correlation** with expert judgements and **Mean Absolute Error (MAE)** metrics alongside the Verdict Confidence to assess the performance of pollux-judge-32b-r and compare it with those of the reference models.
|
231 |
|
232 |
MAE offers a high degree of interpretability, as it is measured on the same scale as the annotation – specifically, in points.
|
233 |
On the other hand, Spearman’s rank correlation allows to quantify the degree of monotonic association between the two rankings of models outputs and
|
|
|
245 |
|
246 |
Spearman’s rank correlation:
|
247 |
|
248 |
+
| Model | pollux-judge-32b-r | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
|
249 |
| --- | --- | --- | --- | --- |
|
250 |
+
| Claude 3.5 Sonnet (2024-10-22) | 0.642 | 0.739 | -0.006 | 0.759 |
|
251 |
+
| GPT-4o (2024-08-06) | 0.564 | 0.627 | -0.033 | 0.643 |
|
252 |
+
| GigaChat-Max (1.0.26.20) | 0.573 | 0.640 | 0.027 | 0.649 |
|
253 |
+
| Llama-3.1-405B | 0.570 | 0.591 | 0.022 | 0.639 |
|
254 |
+
| T-pro-it-1.0 | 0.526 | 0.573 | -0.044 | 0.616 |
|
255 |
+
| YaGPT-4-Pro (2024-10-23) | 0.583 | 0.635 | 0.099 | 0.671 |
|
256 |
+
|o1 (2024-12-17) | 0.654 | 0.748 | -0.022 | 0.771 |
|
257 |
+
| Avg. | 0.589 | 0.647 | 0.019 | 0.674 |
|
258 |
|
259 |
MAE (MAE values are given in parenthesis):
|
260 |
|
261 |
+
| Model | pollux-judge-32b-r | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
|
262 |
| --- | --- | --- | --- | --- |
|
263 |
+
| Claude 3.5 Sonnet (2024-10-22) | 0.487 | 0.245 | 2.697 | 0.236 |
|
264 |
+
| GPT-4o (2024-08-06) | 0.466 | 0.349 | 2.676 | 0.339 |
|
265 |
+
| GigaChat-Max (1.0.26.20) | 0.460 | 0.350 | 2.468 | 0.342 |
|
266 |
+
| Llama-3.1-405B | 0.508 | 0.448 | 1.912 | 0.405 |
|
267 |
+
| T-pro-it-1.0 | 0.492 | 0.475 | 2.978 | 0.425 |
|
268 |
+
| YaGPT-4-Pro (2024-10-23) | 0.497 | 0.387 | 1.793 | 0.369 |
|
269 |
+
|o1 (2024-12-17) | 0.448 | 0.244 | 2.873 | 0.229 |
|
270 |
+
| Avg. | 0.479 | 0.356 | 2.487 | 0.335 |
|
271 |
|
272 |
Verdict Confidence (calculated on the whole test sample):
|
273 |
|
274 |
+
| Model | pollux-judge-32b-r | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
|
275 |
| --- | --- | --- | --- | --- |
|
276 |
+
| Claude 3.5 Sonnet (2024-10-22) | 0.806 | 0.879 | 0.645 | 0.877 |
|
277 |
+
| GPT-4o (2024-08-06) | 0.825 | 0.877 | 0.702 | 0.877 |
|
278 |
+
| GigaChat-Max (1.0.26.20) | 0.828 | 0.878 | 0.715 | 0.879 |
|
279 |
+
| Llama-3.1-405B | 0.778 | 0.836 | 0.684 | 0.837 |
|
280 |
+
| T-pro-it-1.0 | 0.793 | 0.838 | 0.644 | 0.842 |
|
281 |
+
| YaGPT-4-Pro (2024-10-23) | 0.815 | 0.866 | 0.738 | 0.867 |
|
282 |
+
|o1 (2024-12-17) | 0.822 | 0.885 | 0.643 | 0.882 |
|
283 |
+
| Avg. | 0.811 | 0.866 | 0.684 | 0.867 |
|
284 |
|
285 |
|
286 |
## Technical Specifications
|