ai-forever commited on
Commit
d0f6db6
·
verified ·
1 Parent(s): b384085

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -41
README.md CHANGED
@@ -9,13 +9,13 @@ library_name: transformers
9
  tags:
10
  - pytorch
11
  ---
12
- # pollux-judge-7b
13
 
14
  <!-- Provide a quick summary of what the model is/does. -->
15
 
16
  ![banner](images/logo_pollux_horiz_short_WHITEBG.png)
17
 
18
- pollux-judge-7b is a 7-billion parameter generative language model specifically designed to evaluate the quality of other language models' responses in Russian.
19
  The model assesses answer quality given input instruction, specific criteria and rubrics, providing automated LLM performance evaluation for Russian-language tasks.
20
 
21
  ## Model Details
@@ -24,16 +24,11 @@ The model assesses answer quality given input instruction, specific criteria and
24
 
25
  <!-- Provide a longer summary of what this model is. -->
26
 
27
- pollux-judge-7b is a part of POLLUX project, which is dedicated to evaluation of generative capabilities of Large Language Models (LLMs).
28
- Part of this project is the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX) that introduces taxonomies of both generative tasks and evaluation criteria alongside quantitative and qualitative estimation of top-tier LLMs responses.
29
- pollux-judge-7b is a decoder model based on [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0) and trained sequence-to-sequence to predict numerical score and textual rationale based on the given input instruction, LLM's answer, particular criterion, rubrics and reference answer if any.
30
- Model technicaly works for any type of instruction and criterion formatted in an appropriate format, but has been trained on the instructions and criteria from the taxonomies of tasks and criteria from the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
31
-
32
- pollux-judge-7b is an integral component of the POLLUX project, a comprehensive initiative dedicated to evaluating the generative capabilities of Large Language Models (LLMs).
33
  At the heart of this project lies the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX), which introduces systematic taxonomies for both generative tasks and evaluation criteria, providing quantitative and qualitative assessments of responses from top-tier LLMs.
34
 
35
- Built upon the [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0) architecture, pollux-judge-7b is a decoder-based 7 billion parameter model trained in a sequence-to-sequence fashion.
36
- The model is designed to predict both numerical scores and detailed textual rationales based on the original instruction, the LLM's response, specific evaluation criteria, scoring rubrics, and reference answers when available.
37
 
38
  While the model is technically capable of processing any type of instruction and criterion when properly formatted, its training has been specifically optimized using the generative tasks and evaluation criteria derived from the taxonomies established within the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
39
 
@@ -58,7 +53,7 @@ While the model is technically capable of processing any type of instruction and
58
 
59
  <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
60
 
61
- pollux-judge-7b is specifically designed for assessing text responses against a single, predefined criterion per evaluation run.
62
  The model operates optimally when provided with all essential components: the source instruction, the response to be evaluated (typically generated by another LLM), the specific evaluation criterion, and its corresponding scoring rubrics.
63
 
64
 
@@ -76,7 +71,7 @@ For optimal performance and reliable results, users should structure each evalua
76
 
77
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
78
 
79
- All content, responses, and outputs generated by pollux-judge-7b (the "Model") are produced through automated computational processes based on statistical patterns learned from pre-training data.
80
  Such outputs do not constitute statements, opinions, recommendations, or positions of the model developers, publishers, or affiliated entities (collectively, the "Developers").
81
 
82
  The Model's outputs do not represent, reflect, or endorse any views, beliefs, policies, or positions held by the Developers.
@@ -139,7 +134,7 @@ prompt = PROMPT_TEMPLATE.format(instruction=instruction,
139
  criteria_name=criteria_name,
140
  criteria_rubrics=criteria_rubrics)
141
 
142
- MODEL_PATH = "ai-forever/pollux-judge-7b"
143
  tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
144
  model = AutoModelForCausalLM.from_pretrained(
145
  MODEL_PATH,
@@ -198,9 +193,10 @@ From this dataset, we performed stratified random sampling across tasks to obtai
198
 
199
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
200
 
201
- The model was trained in sequence-to-sequence fashion.
202
- Input includes source instruction, LLM's answer, name of criterion, its rubrics and reference answer if present.
203
- The output is expected to be numerical score from provided rubrics and textual explanation.
 
204
 
205
  #### Training Hyperparameters
206
 
@@ -246,42 +242,42 @@ For detailed evaluation results see Appendix D in the [preprint](https://arxiv.o
246
 
247
  Spearman’s rank correlation:
248
 
249
- | Model | POLLUX 7B (regression) | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
250
  | --- | --- | --- | --- | --- |
251
- | Claude 3.5 Sonnet (2024-10-22) | 0.660 | 0.739 | -0.006 | 0.759 |
252
- | GPT-4o (2024-08-06) | 0.596 | 0.627 | -0.033 | 0.643 |
253
- | GigaChat-Max (1.0.26.20) | 0.596 | 0.640 | 0.027 | 0.649 |
254
- | Llama-3.1-405B | 0.613 | 0.591 | 0.022 | 0.639 |
255
- | T-pro-it-1.0 | 0.571 | 0.573 | -0.044 | 0.616 |
256
- | YaGPT-4-Pro (2024-10-23) | 0.616 | 0.635 | 0.099 | 0.671 |
257
- |o1 (2024-12-17) | 0.675 | 0.748 | -0.022 | 0.771 |
258
- | Avg. | 0.619 | 0.647 | 0.019 | 0.674 |
259
 
260
  MAE (MAE values are given in parenthesis):
261
 
262
- | Model | POLLUX 7B (regression) | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
263
  | --- | --- | --- | --- | --- |
264
- | Claude 3.5 Sonnet (2024-10-22) | 0.501 | 0.245 | 2.697 | 0.236 |
265
- | GPT-4o (2024-08-06) | 0.484 | 0.349 | 2.676 | 0.339 |
266
- | GigaChat-Max (1.0.26.20) | 0.477 | 0.350 | 2.468 | 0.342 |
267
- | Llama-3.1-405B | 0.517 | 0.448 | 1.912 | 0.405 |
268
- | T-pro-it-1.0 | 0.497 | 0.475 | 2.978 | 0.425 |
269
- | YaGPT-4-Pro (2024-10-23) | 0.511 | 0.387 | 1.793 | 0.369 |
270
- |o1 (2024-12-17) | 0.438 | 0.244 | 2.873 | 0.229 |
271
- | Avg. | 0.489 | 0.356 | 2.487 | 0.335 |
272
 
273
  Verdict Confidence (calculated on the whole test sample):
274
 
275
- | Model | POLLUX 7B (regression) | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
276
  | --- | --- | --- | --- | --- |
277
- | Claude 3.5 Sonnet (2024-10-22) | 0.800 | 0.879 | 0.645 | 0.877 |
278
- | GPT-4o (2024-08-06) | 0.822 | 0.877 | 0.702 | 0.877 |
279
  | GigaChat-Max (1.0.26.20) | 0.824 | 0.878 | 0.715 | 0.879 |
280
  | Llama-3.1-405B | 0.777 | 0.836 | 0.684 | 0.837 |
281
- | T-pro-it-1.0 | 0.791 | 0.838 | 0.644 | 0.842 |
282
- | YaGPT-4-Pro (2024-10-23) | 0.813 | 0.866 | 0.738 | 0.867 |
283
- |o1 (2024-12-17) | 0.821 | 0.885 | 0.643 | 0.882 |
284
- | Avg. | 0.808 | 0.866 | 0.684 | 0.867 |
285
 
286
 
287
  ## Technical Specifications [optional]
 
9
  tags:
10
  - pytorch
11
  ---
12
+ # pollux-judge-7b-r
13
 
14
  <!-- Provide a quick summary of what the model is/does. -->
15
 
16
  ![banner](images/logo_pollux_horiz_short_WHITEBG.png)
17
 
18
+ pollux-judge-7b-r is a 7-billion parameter generative language model specifically designed to evaluate the quality of other language models' responses in Russian.
19
  The model assesses answer quality given input instruction, specific criteria and rubrics, providing automated LLM performance evaluation for Russian-language tasks.
20
 
21
  ## Model Details
 
24
 
25
  <!-- Provide a longer summary of what this model is. -->
26
 
27
+ pollux-judge-7b-r is an integral component of the POLLUX project, a comprehensive initiative dedicated to evaluating the generative capabilities of Large Language Models (LLMs).
 
 
 
 
 
28
  At the heart of this project lies the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX), which introduces systematic taxonomies for both generative tasks and evaluation criteria, providing quantitative and qualitative assessments of responses from top-tier LLMs.
29
 
30
+ Built upon the [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0) architecture, pollux-judge-7b-r is a decoder-based 7 billion parameter model trained with a combination of Mean Square Error (for regression head) and Cross-Entropy (for language modeling head) objectives.
31
+ The model is designed to predict both numerical scores and detailed textual rationales with separate heads based on the original instruction, the LLM's response, specific evaluation criteria, scoring rubrics, and reference answers when available.
32
 
33
  While the model is technically capable of processing any type of instruction and criterion when properly formatted, its training has been specifically optimized using the generative tasks and evaluation criteria derived from the taxonomies established within the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
34
 
 
53
 
54
  <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
55
 
56
+ pollux-judge-7b-r is specifically designed for assessing text responses against a single, predefined criterion per evaluation run.
57
  The model operates optimally when provided with all essential components: the source instruction, the response to be evaluated (typically generated by another LLM), the specific evaluation criterion, and its corresponding scoring rubrics.
58
 
59
 
 
71
 
72
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
73
 
74
+ All content, responses, and outputs generated by pollux-judge-7b-r (the "Model") are produced through automated computational processes based on statistical patterns learned from pre-training data.
75
  Such outputs do not constitute statements, opinions, recommendations, or positions of the model developers, publishers, or affiliated entities (collectively, the "Developers").
76
 
77
  The Model's outputs do not represent, reflect, or endorse any views, beliefs, policies, or positions held by the Developers.
 
134
  criteria_name=criteria_name,
135
  criteria_rubrics=criteria_rubrics)
136
 
137
+ MODEL_PATH = "ai-forever/pollux-judge-7b-r"
138
  tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
139
  model = AutoModelForCausalLM.from_pretrained(
140
  MODEL_PATH,
 
193
 
194
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
195
 
196
+
197
+ The input for the LM-as-a-Judge model includes source instruction, LLM's answer, name of criterion, its rubrics and reference answer if present.
198
+ Separate regression head predicts the numerical score and language modeling head generates textual comment.
199
+ The total loss is a sum of MSE and CE objectives.
200
 
201
  #### Training Hyperparameters
202
 
 
242
 
243
  Spearman’s rank correlation:
244
 
245
+ | Model | pollux-judge-7b-r | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
246
  | --- | --- | --- | --- | --- |
247
+ | Claude 3.5 Sonnet (2024-10-22) | 0.653 | 0.739 | -0.006 | 0.759 |
248
+ | GPT-4o (2024-08-06) | 0.572 | 0.627 | -0.033 | 0.643 |
249
+ | GigaChat-Max (1.0.26.20) | 0.582 | 0.640 | 0.027 | 0.649 |
250
+ | Llama-3.1-405B | 0.587 | 0.591 | 0.022 | 0.639 |
251
+ | T-pro-it-1.0 | 0.543 | 0.573 | -0.044 | 0.616 |
252
+ | YaGPT-4-Pro (2024-10-23) | 0.599 | 0.635 | 0.099 | 0.671 |
253
+ |o1 (2024-12-17) | 0.674 | 0.748 | -0.022 | 0.771 |
254
+ | Avg. | 0.602 | 0.647 | 0.019 | 0.674 |
255
 
256
  MAE (MAE values are given in parenthesis):
257
 
258
+ | Model | pollux-judge-7b-r | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
259
  | --- | --- | --- | --- | --- |
260
+ | Claude 3.5 Sonnet (2024-10-22) | 0.519 | 0.245 | 2.697 | 0.236 |
261
+ | GPT-4o (2024-08-06) | 0.489 | 0.349 | 2.676 | 0.339 |
262
+ | GigaChat-Max (1.0.26.20) | 0.478 | 0.350 | 2.468 | 0.342 |
263
+ | Llama-3.1-405B | 0.513 | 0.448 | 1.912 | 0.405 |
264
+ | T-pro-it-1.0 | 0.503 | 0.475 | 2.978 | 0.425 |
265
+ | YaGPT-4-Pro (2024-10-23) | 0.495 | 0.387 | 1.793 | 0.369 |
266
+ |o1 (2024-12-17) | 0.460 | 0.244 | 2.873 | 0.229 |
267
+ | Avg. | 0.494 | 0.356 | 2.487 | 0.335 |
268
 
269
  Verdict Confidence (calculated on the whole test sample):
270
 
271
+ | Model | pollux-judge-7b-r | DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
272
  | --- | --- | --- | --- | --- |
273
+ | Claude 3.5 Sonnet (2024-10-22) | 0.795 | 0.879 | 0.645 | 0.877 |
274
+ | GPT-4o (2024-08-06) | 0.820 | 0.877 | 0.702 | 0.877 |
275
  | GigaChat-Max (1.0.26.20) | 0.824 | 0.878 | 0.715 | 0.879 |
276
  | Llama-3.1-405B | 0.777 | 0.836 | 0.684 | 0.837 |
277
+ | T-pro-it-1.0 | 0.787 | 0.838 | 0.644 | 0.842 |
278
+ | YaGPT-4-Pro (2024-10-23) | 0.814 | 0.866 | 0.738 | 0.867 |
279
+ |o1 (2024-12-17) | 0.814 | 0.885 | 0.643 | 0.882 |
280
+ | Avg. | 0.806 | 0.866 | 0.684 | 0.867 |
281
 
282
 
283
  ## Technical Specifications [optional]