ai-forever commited on
Commit
ecb9792
·
verified ·
1 Parent(s): 26fe676

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -37
README.md CHANGED
@@ -3,7 +3,7 @@ license: mit
3
  language:
4
  - ru
5
  base_model:
6
- - t-tech/T-lite-it-1.0
7
  pipeline_tag: text-generation
8
  library_name: transformers
9
  tags:
@@ -12,13 +12,13 @@ metrics:
12
  - mae
13
  - pearsonr
14
  ---
15
- # pollux-judge-7b-r
16
 
17
  <!-- Provide a quick summary of what the model is/does. -->
18
 
19
  ![banner](images/logo_pollux_horiz_short_WHITEBG.png)
20
 
21
- pollux-judge-7b-r is a 7-billion parameter generative language model specifically designed to evaluate the quality of other language models' responses in Russian.
22
  The model assesses answer quality given input instruction, specific criteria and rubrics, providing automated LLM performance evaluation for Russian-language tasks.
23
 
24
  ## Model Details
@@ -27,10 +27,10 @@ The model assesses answer quality given input instruction, specific criteria and
27
 
28
  <!-- Provide a longer summary of what this model is. -->
29
 
30
- pollux-judge-7b-r is an integral component of the POLLUX project, a comprehensive initiative dedicated to evaluating the generative capabilities of Large Language Models (LLMs).
31
  At the heart of this project lies the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX), which introduces systematic taxonomies for both generative tasks and evaluation criteria, providing quantitative and qualitative assessments of responses from top-tier LLMs.
32
 
33
- Built upon the [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0) architecture, pollux-judge-7b-r is a decoder-based 7 billion parameter model trained with a combination of Mean Square Error (for regression head) and Cross-Entropy (for language modeling head) objectives.
34
  The model is designed to predict both numerical scores and detailed textual rationales with separate heads based on the original instruction, the LLM's response, specific evaluation criteria, scoring rubrics, and reference answers when available.
35
 
36
  While the model is technically capable of processing any type of instruction and criterion when properly formatted, its training has been specifically optimized using the generative tasks and evaluation criteria derived from the taxonomies established within the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
@@ -39,7 +39,7 @@ While the model is technically capable of processing any type of instruction and
39
  - **Model type:** decoder
40
  - **Language(s) (NLP):** Russian
41
  - **License:** MIT
42
- - **Finetuned from model:** [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0)
43
 
44
  ### Model Sources
45
 
@@ -56,7 +56,7 @@ While the model is technically capable of processing any type of instruction and
56
 
57
  <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
58
 
59
- pollux-judge-7b-r is specifically designed for assessing text responses against a single, predefined criterion per evaluation run.
60
  The model operates optimally when provided with all essential components: the source instruction, the response to be evaluated (typically generated by another LLM), the specific evaluation criterion, and its corresponding scoring rubrics.
61
 
62
 
@@ -74,7 +74,7 @@ For optimal performance and reliable results, users should structure each evalua
74
 
75
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
76
 
77
- All content, responses, and outputs generated by pollux-judge-7b-r (the "Model") are produced through automated computational processes based on statistical patterns learned from pre-training data.
78
  Such outputs do not constitute statements, opinions, recommendations, or positions of the model developers, publishers, or affiliated entities (collectively, the "Developers").
79
 
80
  The Model's outputs do not represent, reflect, or endorse any views, beliefs, policies, or positions held by the Developers.
@@ -137,7 +137,7 @@ prompt = PROMPT_TEMPLATE.format(instruction=instruction,
137
  criteria_name=criteria_name,
138
  criteria_rubrics=criteria_rubrics)
139
 
140
- MODEL_PATH = "ai-forever/pollux-judge-7b-r"
141
  tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
142
  model = AutoModelForCausalLM.from_pretrained(
143
  MODEL_PATH,
@@ -227,7 +227,7 @@ Note this provides both in- and out-of-domain evaluation as some of the tasks an
227
 
228
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
229
 
230
- We employed **Spearman’s rank correlation** with expert judgements and **Mean Absolute Error (MAE)** metrics alongside the Verdict Confidence to assess the performance of pollux-judge-7b and compare it with those of the reference models.
231
 
232
  MAE offers a high degree of interpretability, as it is measured on the same scale as the annotation – specifically, in points.
233
  On the other hand, Spearman’s rank correlation allows to quantify the degree of monotonic association between the two rankings of models outputs and
@@ -245,42 +245,42 @@ For detailed evaluation results see Appendix D in the [preprint](https://arxiv.o
245
 
246
  Spearman’s rank correlation:
247
 
248
- | Model | pollux-judge-7b-r | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
249
  | --- | --- | --- | --- | --- |
250
- | Claude 3.5 Sonnet (2024-10-22) | 0.653 | 0.739 | -0.006 | 0.759 |
251
- | GPT-4o (2024-08-06) | 0.572 | 0.627 | -0.033 | 0.643 |
252
- | GigaChat-Max (1.0.26.20) | 0.582 | 0.640 | 0.027 | 0.649 |
253
- | Llama-3.1-405B | 0.587 | 0.591 | 0.022 | 0.639 |
254
- | T-pro-it-1.0 | 0.543 | 0.573 | -0.044 | 0.616 |
255
- | YaGPT-4-Pro (2024-10-23) | 0.599 | 0.635 | 0.099 | 0.671 |
256
- |o1 (2024-12-17) | 0.674 | 0.748 | -0.022 | 0.771 |
257
- | Avg. | 0.602 | 0.647 | 0.019 | 0.674 |
258
 
259
  MAE (MAE values are given in parenthesis):
260
 
261
- | Model | pollux-judge-7b-r | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
262
  | --- | --- | --- | --- | --- |
263
- | Claude 3.5 Sonnet (2024-10-22) | 0.519 | 0.245 | 2.697 | 0.236 |
264
- | GPT-4o (2024-08-06) | 0.489 | 0.349 | 2.676 | 0.339 |
265
- | GigaChat-Max (1.0.26.20) | 0.478 | 0.350 | 2.468 | 0.342 |
266
- | Llama-3.1-405B | 0.513 | 0.448 | 1.912 | 0.405 |
267
- | T-pro-it-1.0 | 0.503 | 0.475 | 2.978 | 0.425 |
268
- | YaGPT-4-Pro (2024-10-23) | 0.495 | 0.387 | 1.793 | 0.369 |
269
- |o1 (2024-12-17) | 0.460 | 0.244 | 2.873 | 0.229 |
270
- | Avg. | 0.494 | 0.356 | 2.487 | 0.335 |
271
 
272
  Verdict Confidence (calculated on the whole test sample):
273
 
274
- | Model | pollux-judge-7b-r | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
275
  | --- | --- | --- | --- | --- |
276
- | Claude 3.5 Sonnet (2024-10-22) | 0.795 | 0.879 | 0.645 | 0.877 |
277
- | GPT-4o (2024-08-06) | 0.820 | 0.877 | 0.702 | 0.877 |
278
- | GigaChat-Max (1.0.26.20) | 0.824 | 0.878 | 0.715 | 0.879 |
279
- | Llama-3.1-405B | 0.777 | 0.836 | 0.684 | 0.837 |
280
- | T-pro-it-1.0 | 0.787 | 0.838 | 0.644 | 0.842 |
281
- | YaGPT-4-Pro (2024-10-23) | 0.814 | 0.866 | 0.738 | 0.867 |
282
- |o1 (2024-12-17) | 0.814 | 0.885 | 0.643 | 0.882 |
283
- | Avg. | 0.806 | 0.866 | 0.684 | 0.867 |
284
 
285
 
286
  ## Technical Specifications
 
3
  language:
4
  - ru
5
  base_model:
6
+ - t-tech/T-pro-it-1.0
7
  pipeline_tag: text-generation
8
  library_name: transformers
9
  tags:
 
12
  - mae
13
  - pearsonr
14
  ---
15
+ # pollux-judge-32b-r
16
 
17
  <!-- Provide a quick summary of what the model is/does. -->
18
 
19
  ![banner](images/logo_pollux_horiz_short_WHITEBG.png)
20
 
21
+ pollux-judge-32b-r is a 32-billion parameter generative language model specifically designed to evaluate the quality of other language models' responses in Russian.
22
  The model assesses answer quality given input instruction, specific criteria and rubrics, providing automated LLM performance evaluation for Russian-language tasks.
23
 
24
  ## Model Details
 
27
 
28
  <!-- Provide a longer summary of what this model is. -->
29
 
30
+ pollux-judge-32b-r is an integral component of the POLLUX project, a comprehensive initiative dedicated to evaluating the generative capabilities of Large Language Models (LLMs).
31
  At the heart of this project lies the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX), which introduces systematic taxonomies for both generative tasks and evaluation criteria, providing quantitative and qualitative assessments of responses from top-tier LLMs.
32
 
33
+ Built upon the [t-tech/T-pro-it-1.0](https://huggingface.co/t-tech/T-pro-it-1.0) architecture, pollux-judge-32b-r is a decoder-based 32 billion parameter model trained with a combination of Mean Square Error (for regression head) and Cross-Entropy (for language modeling head) objectives.
34
  The model is designed to predict both numerical scores and detailed textual rationales with separate heads based on the original instruction, the LLM's response, specific evaluation criteria, scoring rubrics, and reference answers when available.
35
 
36
  While the model is technically capable of processing any type of instruction and criterion when properly formatted, its training has been specifically optimized using the generative tasks and evaluation criteria derived from the taxonomies established within the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
 
39
  - **Model type:** decoder
40
  - **Language(s) (NLP):** Russian
41
  - **License:** MIT
42
+ - **Finetuned from model:** [t-tech/T-pro-it-1.0](https://huggingface.co/t-tech/T-pro-it-1.0)
43
 
44
  ### Model Sources
45
 
 
56
 
57
  <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
58
 
59
+ pollux-judge-32b-r is specifically designed for assessing text responses against a single, predefined criterion per evaluation run.
60
  The model operates optimally when provided with all essential components: the source instruction, the response to be evaluated (typically generated by another LLM), the specific evaluation criterion, and its corresponding scoring rubrics.
61
 
62
 
 
74
 
75
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
76
 
77
+ All content, responses, and outputs generated by pollux-judge-32b-r (the "Model") are produced through automated computational processes based on statistical patterns learned from pre-training data.
78
  Such outputs do not constitute statements, opinions, recommendations, or positions of the model developers, publishers, or affiliated entities (collectively, the "Developers").
79
 
80
  The Model's outputs do not represent, reflect, or endorse any views, beliefs, policies, or positions held by the Developers.
 
137
  criteria_name=criteria_name,
138
  criteria_rubrics=criteria_rubrics)
139
 
140
+ MODEL_PATH = "ai-forever/pollux-judge-32b-r"
141
  tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
142
  model = AutoModelForCausalLM.from_pretrained(
143
  MODEL_PATH,
 
227
 
228
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
229
 
230
+ We employed **Spearman’s rank correlation** with expert judgements and **Mean Absolute Error (MAE)** metrics alongside the Verdict Confidence to assess the performance of pollux-judge-32b-r and compare it with those of the reference models.
231
 
232
  MAE offers a high degree of interpretability, as it is measured on the same scale as the annotation – specifically, in points.
233
  On the other hand, Spearman’s rank correlation allows to quantify the degree of monotonic association between the two rankings of models outputs and
 
245
 
246
  Spearman’s rank correlation:
247
 
248
+ | Model | pollux-judge-32b-r | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
249
  | --- | --- | --- | --- | --- |
250
+ | Claude 3.5 Sonnet (2024-10-22) | 0.642 | 0.739 | -0.006 | 0.759 |
251
+ | GPT-4o (2024-08-06) | 0.564 | 0.627 | -0.033 | 0.643 |
252
+ | GigaChat-Max (1.0.26.20) | 0.573 | 0.640 | 0.027 | 0.649 |
253
+ | Llama-3.1-405B | 0.570 | 0.591 | 0.022 | 0.639 |
254
+ | T-pro-it-1.0 | 0.526 | 0.573 | -0.044 | 0.616 |
255
+ | YaGPT-4-Pro (2024-10-23) | 0.583 | 0.635 | 0.099 | 0.671 |
256
+ |o1 (2024-12-17) | 0.654 | 0.748 | -0.022 | 0.771 |
257
+ | Avg. | 0.589 | 0.647 | 0.019 | 0.674 |
258
 
259
  MAE (MAE values are given in parenthesis):
260
 
261
+ | Model | pollux-judge-32b-r | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
262
  | --- | --- | --- | --- | --- |
263
+ | Claude 3.5 Sonnet (2024-10-22) | 0.487 | 0.245 | 2.697 | 0.236 |
264
+ | GPT-4o (2024-08-06) | 0.466 | 0.349 | 2.676 | 0.339 |
265
+ | GigaChat-Max (1.0.26.20) | 0.460 | 0.350 | 2.468 | 0.342 |
266
+ | Llama-3.1-405B | 0.508 | 0.448 | 1.912 | 0.405 |
267
+ | T-pro-it-1.0 | 0.492 | 0.475 | 2.978 | 0.425 |
268
+ | YaGPT-4-Pro (2024-10-23) | 0.497 | 0.387 | 1.793 | 0.369 |
269
+ |o1 (2024-12-17) | 0.448 | 0.244 | 2.873 | 0.229 |
270
+ | Avg. | 0.479 | 0.356 | 2.487 | 0.335 |
271
 
272
  Verdict Confidence (calculated on the whole test sample):
273
 
274
+ | Model | pollux-judge-32b-r | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
275
  | --- | --- | --- | --- | --- |
276
+ | Claude 3.5 Sonnet (2024-10-22) | 0.806 | 0.879 | 0.645 | 0.877 |
277
+ | GPT-4o (2024-08-06) | 0.825 | 0.877 | 0.702 | 0.877 |
278
+ | GigaChat-Max (1.0.26.20) | 0.828 | 0.878 | 0.715 | 0.879 |
279
+ | Llama-3.1-405B | 0.778 | 0.836 | 0.684 | 0.837 |
280
+ | T-pro-it-1.0 | 0.793 | 0.838 | 0.644 | 0.842 |
281
+ | YaGPT-4-Pro (2024-10-23) | 0.815 | 0.866 | 0.738 | 0.867 |
282
+ |o1 (2024-12-17) | 0.822 | 0.885 | 0.643 | 0.882 |
283
+ | Avg. | 0.811 | 0.866 | 0.684 | 0.867 |
284
 
285
 
286
  ## Technical Specifications