ai-forever
/

pollux-judge-7b-r

@@ -9,13 +9,13 @@ library_name: transformers
 tags:
 - pytorch
 ---
-# pollux-judge-7b
 <!-- Provide a quick summary of what the model is/does. -->
 ![banner](images/logo_pollux_horiz_short_WHITEBG.png)
-pollux-judge-7b is a 7-billion parameter generative language model specifically designed to evaluate the quality of other language models' responses in Russian.
 The model assesses answer quality given input instruction, specific criteria and rubrics, providing automated LLM performance evaluation for Russian-language tasks.
 ## Model Details
@@ -24,16 +24,11 @@ The model assesses answer quality given input instruction, specific criteria and
 <!-- Provide a longer summary of what this model is. -->
-pollux-judge-7b is a part of POLLUX project, which is dedicated to evaluation of generative capabilities of Large Language Models (LLMs).
-Part of this project is the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX) that introduces taxonomies of both generative tasks and evaluation criteria alongside quantitative and qualitative estimation of top-tier LLMs responses.
-pollux-judge-7b is a decoder model based on [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0) and trained sequence-to-sequence to predict numerical score and textual rationale based on the given input instruction, LLM's answer, particular criterion, rubrics and reference answer if any.
-Model technicaly works for any type of instruction and criterion formatted in an appropriate format, but has been trained on the instructions and criteria from the taxonomies of tasks and criteria from the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
-pollux-judge-7b is an integral component of the POLLUX project, a comprehensive initiative dedicated to evaluating the generative capabilities of Large Language Models (LLMs).
 At the heart of this project lies the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX), which introduces systematic taxonomies for both generative tasks and evaluation criteria, providing quantitative and qualitative assessments of responses from top-tier LLMs.
-Built upon the [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0) architecture, pollux-judge-7b is a decoder-based 7 billion parameter model trained in a sequence-to-sequence fashion.
-The model is designed to predict both numerical scores and detailed textual rationales based on the original instruction, the LLM's response, specific evaluation criteria, scoring rubrics, and reference answers when available.
 While the model is technically capable of processing any type of instruction and criterion when properly formatted, its training has been specifically optimized using the generative tasks and evaluation criteria derived from the taxonomies established within the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
@@ -58,7 +53,7 @@ While the model is technically capable of processing any type of instruction and
 <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-pollux-judge-7b is specifically designed for assessing text responses against a single, predefined criterion per evaluation run.
 The model operates optimally when provided with all essential components: the source instruction, the response to be evaluated (typically generated by another LLM), the specific evaluation criterion, and its corresponding scoring rubrics.
@@ -76,7 +71,7 @@ For optimal performance and reliable results, users should structure each evalua
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
-All content, responses, and outputs generated by pollux-judge-7b (the "Model") are produced through automated computational processes based on statistical patterns learned from pre-training data.
 Such outputs do not constitute statements, opinions, recommendations, or positions of the model developers, publishers, or affiliated entities (collectively, the "Developers").
 The Model's outputs do not represent, reflect, or endorse any views, beliefs, policies, or positions held by the Developers.
@@ -139,7 +134,7 @@ prompt = PROMPT_TEMPLATE.format(instruction=instruction,
                                 criteria_name=criteria_name,
                                 criteria_rubrics=criteria_rubrics)
-MODEL_PATH = "ai-forever/pollux-judge-7b"
 tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
 model = AutoModelForCausalLM.from_pretrained(
     MODEL_PATH,
@@ -198,9 +193,10 @@ From this dataset, we performed stratified random sampling across tasks to obtai
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-The model was trained in sequence-to-sequence fashion.
-Input includes source instruction, LLM's answer, name of criterion, its rubrics and reference answer if present.
-The output is expected to be numerical score from provided rubrics and textual explanation.
 #### Training Hyperparameters
@@ -246,42 +242,42 @@ For detailed evaluation results see Appendix D in the [preprint](https://arxiv.o
 Spearman’s rank correlation:
-| Model | POLLUX 7B (regression) |  DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
 | --- | --- | --- | --- | --- |
-| Claude 3.5 Sonnet (2024-10-22) | 0.660 | 0.739 | -0.006 | 0.759 |
-| GPT-4o (2024-08-06) |  0.596 |  0.627 | -0.033 | 0.643 |
-| GigaChat-Max (1.0.26.20) |  0.596 |  0.640 | 0.027 | 0.649 |
-| Llama-3.1-405B | 0.613 |  0.591 | 0.022 | 0.639 |
-| T-pro-it-1.0 | 0.571 |  0.573 | -0.044 | 0.616 |
-| YaGPT-4-Pro (2024-10-23) | 0.616 | 0.635 | 0.099 | 0.671 |
-|o1 (2024-12-17) |  0.675 |  0.748 | -0.022 | 0.771 |
-| Avg. | 0.619 |  0.647 | 0.019 | 0.674 |
 MAE (MAE values are given in parenthesis):
-| Model                          | POLLUX 7B (regression) |  DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
 | ---                            | --- | --- | --- | --- |
-| Claude 3.5 Sonnet (2024-10-22) | 0.501 | 0.245 | 2.697 | 0.236 |
-| GPT-4o (2024-08-06)            | 0.484 | 0.349 | 2.676 | 0.339 |
-| GigaChat-Max (1.0.26.20)       | 0.477 | 0.350 | 2.468 | 0.342 |
-| Llama-3.1-405B                 | 0.517 | 0.448 | 1.912 | 0.405 |
-| T-pro-it-1.0                   | 0.497 | 0.475 | 2.978 | 0.425 |
-| YaGPT-4-Pro (2024-10-23)       | 0.511 | 0.387 | 1.793 | 0.369 |
-|o1 (2024-12-17)                 | 0.438 | 0.244 | 2.873 | 0.229 |
-| Avg.                           | 0.489 | 0.356 | 2.487 | 0.335 |
 Verdict Confidence (calculated on the whole test sample):
-| Model                          | POLLUX 7B (regression) |  DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
 | ---                            | --- | --- | --- | --- |
-| Claude 3.5 Sonnet (2024-10-22) | 0.800 | 0.879 | 0.645 | 0.877 |
-| GPT-4o (2024-08-06)            | 0.822 | 0.877 | 0.702 | 0.877 |
 | GigaChat-Max (1.0.26.20)       | 0.824 | 0.878 | 0.715 | 0.879 |
 | Llama-3.1-405B                 | 0.777 | 0.836 | 0.684 | 0.837 |
-| T-pro-it-1.0                   | 0.791 | 0.838 | 0.644 | 0.842 |
-| YaGPT-4-Pro (2024-10-23)       | 0.813 | 0.866 | 0.738 | 0.867 |
-|o1 (2024-12-17)                 | 0.821 | 0.885 | 0.643 | 0.882 |
-| Avg.                           | 0.808 | 0.866 | 0.684 | 0.867 |
 ## Technical Specifications [optional]

 tags:
 - pytorch
 ---
+# pollux-judge-7b-r
 <!-- Provide a quick summary of what the model is/does. -->
 ![banner](images/logo_pollux_horiz_short_WHITEBG.png)
+pollux-judge-7b-r is a 7-billion parameter generative language model specifically designed to evaluate the quality of other language models' responses in Russian.
 The model assesses answer quality given input instruction, specific criteria and rubrics, providing automated LLM performance evaluation for Russian-language tasks.
 ## Model Details
 <!-- Provide a longer summary of what this model is. -->
+pollux-judge-7b-r is an integral component of the POLLUX project, a comprehensive initiative dedicated to evaluating the generative capabilities of Large Language Models (LLMs).
 At the heart of this project lies the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX), which introduces systematic taxonomies for both generative tasks and evaluation criteria, providing quantitative and qualitative assessments of responses from top-tier LLMs.
+Built upon the [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0) architecture, pollux-judge-7b-r is a decoder-based 7 billion parameter model trained with a combination of Mean Square Error (for regression head) and Cross-Entropy (for language modeling head) objectives.
+The model is designed to predict both numerical scores and detailed textual rationales with separate heads based on the original instruction, the LLM's response, specific evaluation criteria, scoring rubrics, and reference answers when available.
 While the model is technically capable of processing any type of instruction and criterion when properly formatted, its training has been specifically optimized using the generative tasks and evaluation criteria derived from the taxonomies established within the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
 <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+pollux-judge-7b-r is specifically designed for assessing text responses against a single, predefined criterion per evaluation run.
 The model operates optimally when provided with all essential components: the source instruction, the response to be evaluated (typically generated by another LLM), the specific evaluation criterion, and its corresponding scoring rubrics.
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
+All content, responses, and outputs generated by pollux-judge-7b-r (the "Model") are produced through automated computational processes based on statistical patterns learned from pre-training data.
 Such outputs do not constitute statements, opinions, recommendations, or positions of the model developers, publishers, or affiliated entities (collectively, the "Developers").
 The Model's outputs do not represent, reflect, or endorse any views, beliefs, policies, or positions held by the Developers.
                                 criteria_name=criteria_name,
                                 criteria_rubrics=criteria_rubrics)
+MODEL_PATH = "ai-forever/pollux-judge-7b-r"
 tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
 model = AutoModelForCausalLM.from_pretrained(
     MODEL_PATH,
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+The input for the LM-as-a-Judge model includes source instruction, LLM's answer, name of criterion, its rubrics and reference answer if present.
+Separate regression head predicts the numerical score and language modeling head generates textual comment.
+The total loss is a sum of MSE and CE objectives.
 #### Training Hyperparameters
 Spearman’s rank correlation:
+| Model | pollux-judge-7b-r |  DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
 | --- | --- | --- | --- | --- |
+| Claude 3.5 Sonnet (2024-10-22) |  0.653 | 0.739 | -0.006 | 0.759 |
+| GPT-4o (2024-08-06) |  0.572 |  0.627 | -0.033 | 0.643 |
+| GigaChat-Max (1.0.26.20) |  0.582 |  0.640 | 0.027 | 0.649 |
+| Llama-3.1-405B | 0.587 |  0.591 | 0.022 | 0.639 |
+| T-pro-it-1.0 | 0.543 |  0.573 | -0.044 | 0.616 |
+| YaGPT-4-Pro (2024-10-23) | 0.599 | 0.635 | 0.099 | 0.671 |
+|o1 (2024-12-17) |  0.674 |  0.748 | -0.022 | 0.771 |
+| Avg. | 0.602 |  0.647 | 0.019 | 0.674 |
 MAE (MAE values are given in parenthesis):
+| Model                          | pollux-judge-7b-r |  DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
 | ---                            | --- | --- | --- | --- |
+| Claude 3.5 Sonnet (2024-10-22) | 0.519 | 0.245 | 2.697 | 0.236 |
+| GPT-4o (2024-08-06)            | 0.489 | 0.349 | 2.676 | 0.339 |
+| GigaChat-Max (1.0.26.20)       | 0.478 | 0.350 | 2.468 | 0.342 |
+| Llama-3.1-405B                 | 0.513 | 0.448 | 1.912 | 0.405 |
+| T-pro-it-1.0                   | 0.503 | 0.475 | 2.978 | 0.425 |
+| YaGPT-4-Pro (2024-10-23)       | 0.495 | 0.387 | 1.793 | 0.369 |
+|o1 (2024-12-17)                 | 0.460 | 0.244 | 2.873 | 0.229 |
+| Avg.                           | 0.494 | 0.356 | 2.487 | 0.335 |
 Verdict Confidence (calculated on the whole test sample):
+| Model                          | pollux-judge-7b-r |  DeepSeek-R1 | M-Prometheus-14B | GPT-4o (2024-11-20)|
 | ---                            | --- | --- | --- | --- |
+| Claude 3.5 Sonnet (2024-10-22) | 0.795 | 0.879 | 0.645 | 0.877 |
+| GPT-4o (2024-08-06)            | 0.820 | 0.877 | 0.702 | 0.877 |
 | GigaChat-Max (1.0.26.20)       | 0.824 | 0.878 | 0.715 | 0.879 |
 | Llama-3.1-405B                 | 0.777 | 0.836 | 0.684 | 0.837 |
+| T-pro-it-1.0                   | 0.787 | 0.838 | 0.644 | 0.842 |
+| YaGPT-4-Pro (2024-10-23)       | 0.814 | 0.866 | 0.738 | 0.867 |
+|o1 (2024-12-17)                 | 0.814 | 0.885 | 0.643 | 0.882 |
+| Avg.                           | 0.806 | 0.866 | 0.684 | 0.867 |
 ## Technical Specifications [optional]