sariola commited on
Commit
a7ce10b
·
verified ·
1 Parent(s): cbe9380

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -23
README.md CHANGED
@@ -26,8 +26,6 @@ model_creator: Flow AI
26
  model_type: phi3.5
27
  quantized_by: Flow AI
28
  ---
29
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63368577d184e6b53c50e6d0/6kSJKgPh2pDh4tA-Ky0xW.png)
30
-
31
  # Flow-Judge-v0.1-AWQ
32
  - Original model: [Flow-Judge-v0.1](https://huggingface.co/flowaicom/Flow-Judge-v0.1)
33
  - Model collection: [Flow-Judge-v0.1 models](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec)
@@ -55,27 +53,18 @@ tokenizer.save_pretrained(quant_path)
55
 
56
  TBD
57
 
58
- # Original model card: Flow-Judge-v0.1
59
 
60
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63368577d184e6b53c50e6d0/NgFJqVmUgrhOnphd47VEm.png)
61
 
62
- <div class="center-content">
63
- <div class="links">
64
- <a href="https://github.com/flowaicom/flow-judge">flow-judge library</a>
65
- |
66
- <a href="https://www.flow-ai.com/blog/flow-judge">Technical report</a>
67
- </div>
68
- </div>
69
 
70
  ## Model Summary
71
 
72
  Flow-Judge-v0.1 is a compact yet powerful 3.8B model that offers customizable LLM system evaluations across various fields. The model inherits it's architecture from Phi-3.5-mini instruct model which enables Flow-Judge to deliver high-quality results while maintaining a small footprint. Despite its smaller size, it achieves performance comparable to larger models in both held-out and out-of-domain benchmarks. Flow-Judge-v0.1 supports multiple scoring scales, provides qualitative feedback, and generates structured evaluation outputs. Trained on a smaller synthetic dataset, it represents an efficient approach to AI development. Released under the Apache 2.0 license, Flow Judge is an open and accessible model suitable for developers and companies seeking cost-effective and rapid evaluations using custom rubrics.
73
 
74
- __More information__
75
- - [Flow Judge website](https://www.flow-ai.com/judge)
76
- - [Technical report](https://www.flow-ai.com/blog/flow-judge)
77
- - [Github repo](https://github.com/flowaicom/flow-judge)
78
-
79
  __Quantized weights__
80
  - [flowaicom/Flow-Judge-v0.1-AWQ](https://huggingface.co/flowaicom/Flow-Judge-v0.1-AWQ)
81
  - [flowaicom/Flow-Judge-v0.1-GGUF](https://huggingface.co/flowaicom/Flow-Judge-v0.1-GGUF)
@@ -94,7 +83,7 @@ Flow Judge is intended to be used on custom LLM system evaluation tasks.
94
  - 5-Likert: Provides an even more nuanced assessment, with scores ranging from strongly negative to strongly positive, enabling users to capture subtle differences in quality or sentiment.
95
 
96
  - Easy to interpret results:
97
- - Flow Judge produces structured evaluations with <feedback> and <score> tags.
98
  - Qualitative feedback: Flow Judge detects errors and grades outputs and provides qualitative feedback that explains its reasoning for assigning a particular score from the rubric while highlighting problematic parts of the responses.
99
  - Score: Based on a grading rubric Flow Judge will return a numerical score on binary, likert-3 or likert-5 scale.
100
 
@@ -116,12 +105,12 @@ Flow-Judge-v0.1 has been trained on synthetically generated datasets. The constr
116
 
117
  This process creates a comprehensive and diverse set of training instances that enable accurate, domain-specific evaluations of LLM systems in generative AI products while minimizing human intervention.
118
 
119
- Read more about the dataset construction from [here](https://www.flow-ai.com/blog/flow-judge)
120
 
121
 
122
  ### Fine-tuning
123
 
124
- For fine-tuning we used Axolotl's preprocessing to ensure input training data is consistent. We then conducted supervised fine-tuning based on microsoft/Phi-3.5-mini-instruct using RSLoRa. More detailed information about the fine-tuning process is provided in our [technical report](https://www.flow-ai.com/blog/flow-judge).
125
 
126
  ## Usage
127
 
@@ -406,7 +395,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
406
  </tbody>
407
  </table>
408
 
409
- \* _not suitable for 3 likert_
410
 
411
 
412
  ### RAGTruth
@@ -526,7 +515,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
526
  </tr>
527
  </table>
528
 
529
- \* _reported in Galileo luna paper_
530
 
531
 
532
  ### HaluEval, Covid-QA, PubMedQA
@@ -707,7 +696,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
707
  </tbody>
708
  </table>
709
 
710
- \* _reported in lynx paper_
711
  ### Feedback Bench
712
 
713
  <table border="1" cellpadding="10" cellspacing="0" style="border-collapse: collapse; width: auto;">
@@ -758,4 +747,16 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
758
  </tr>
759
  </table>
760
 
761
- \* _reported in prometheus paper using reference answer. Note the rest of the models have been evaluated without reference answer_
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  model_type: phi3.5
27
  quantized_by: Flow AI
28
  ---
 
 
29
  # Flow-Judge-v0.1-AWQ
30
  - Original model: [Flow-Judge-v0.1](https://huggingface.co/flowaicom/Flow-Judge-v0.1)
31
  - Model collection: [Flow-Judge-v0.1 models](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec)
 
53
 
54
  TBD
55
 
 
56
 
57
+ # Original model card: Flow-Judge-v0.1
58
 
59
+ <p align="center">
60
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/63368577d184e6b53c50e6d0/6kSJKgPh2pDh4tA-Ky0xW.png" alt="Centered image">
61
+ </p>
62
+ <p align="center">🚀 <a href="https://www.flow-ai.com/judge">Flow Judge</a> | 📄 <a href="https://www.flow-ai.com/blog/flow-judge">Technical report</a> | 💻 <a href="https://github.com/flowaicom/flow-judge">flow-judge</a></p>
 
 
 
63
 
64
  ## Model Summary
65
 
66
  Flow-Judge-v0.1 is a compact yet powerful 3.8B model that offers customizable LLM system evaluations across various fields. The model inherits it's architecture from Phi-3.5-mini instruct model which enables Flow-Judge to deliver high-quality results while maintaining a small footprint. Despite its smaller size, it achieves performance comparable to larger models in both held-out and out-of-domain benchmarks. Flow-Judge-v0.1 supports multiple scoring scales, provides qualitative feedback, and generates structured evaluation outputs. Trained on a smaller synthetic dataset, it represents an efficient approach to AI development. Released under the Apache 2.0 license, Flow Judge is an open and accessible model suitable for developers and companies seeking cost-effective and rapid evaluations using custom rubrics.
67
 
 
 
 
 
 
68
  __Quantized weights__
69
  - [flowaicom/Flow-Judge-v0.1-AWQ](https://huggingface.co/flowaicom/Flow-Judge-v0.1-AWQ)
70
  - [flowaicom/Flow-Judge-v0.1-GGUF](https://huggingface.co/flowaicom/Flow-Judge-v0.1-GGUF)
 
83
  - 5-Likert: Provides an even more nuanced assessment, with scores ranging from strongly negative to strongly positive, enabling users to capture subtle differences in quality or sentiment.
84
 
85
  - Easy to interpret results:
86
+ - Flow Judge produces structured evaluations with `<feedback>` and `<score>` tags.
87
  - Qualitative feedback: Flow Judge detects errors and grades outputs and provides qualitative feedback that explains its reasoning for assigning a particular score from the rubric while highlighting problematic parts of the responses.
88
  - Score: Based on a grading rubric Flow Judge will return a numerical score on binary, likert-3 or likert-5 scale.
89
 
 
105
 
106
  This process creates a comprehensive and diverse set of training instances that enable accurate, domain-specific evaluations of LLM systems in generative AI products while minimizing human intervention.
107
 
108
+ Read more about the dataset construction from [here](https://www.flow-ai.com/blog/flow-judge#dataset-construction)
109
 
110
 
111
  ### Fine-tuning
112
 
113
+ For fine-tuning we used Axolotl's preprocessing to ensure input training data is consistent. We then conducted supervised fine-tuning based on microsoft/Phi-3.5-mini-instruct using RSLoRa. More detailed information about the fine-tuning process is provided in our [technical report](https://www.flow-ai.com/blog/flow-judge#fine-tuning).
114
 
115
  ## Usage
116
 
 
395
  </tbody>
396
  </table>
397
 
398
+ \* _Reported in model paper_
399
 
400
 
401
  ### RAGTruth
 
515
  </tr>
516
  </table>
517
 
518
+ \* _reported in model paper_
519
 
520
 
521
  ### HaluEval, Covid-QA, PubMedQA
 
696
  </tbody>
697
  </table>
698
 
699
+ \* _reported in model paper_
700
  ### Feedback Bench
701
 
702
  <table border="1" cellpadding="10" cellspacing="0" style="border-collapse: collapse; width: auto;">
 
747
  </tr>
748
  </table>
749
 
750
+ \* _reported in model paper using reference answers_
751
+
752
+ ## License
753
+ We opted for the Apache 2.0 license for Flow Judge to provide the community with an open, small yet powerful LM evaluator. Our goal is to support the wider adoption of rigorous evaluation techniques in LLM system development, making them more accessible to practitioners and researchers.
754
+
755
+ ## Limitations and future work
756
+ Multilingual evaluation: Flow Judge has been fine-tuned exclusively on English data. While the foundation model (Phi-3.5-mini-instruct [17]) may possess multilingual capabilities, we have not systematically evaluated Flow Judge performance in non-English contexts. We plan to explore multi-lingual LM evaluators in the future.
757
+
758
+ Long context and structured Inputs: Our training dataset encompasses a wide range of custom metrics relevant to evaluating LLM systems. However, it does not include examples with long context inputs or structured data formats such as JSON, since these are harder to synthetically generate. This limitation may impact Flow Judge's performance when evaluating responses that require processing extensive context or parsing structured input. Extending our model’s capabilities to handle these input types represents an important area for future research.
759
+
760
+ Math and coding: The current version has not been trained on specific task domains such as arithmetic problems or code evaluation. As a result, its performance in these specialized areas may be limited. Future iterations of the model should address these gaps.
761
+
762
+ Domain-specific knowledge and complex multi-step evaluations: Flow Judge may struggle with highly specialized domain knowledge or proprietary data outside the training scope of its foundation model. Additionally, evaluation tasks requiring multi-step reasoning or complex logical processes may challenge the model's capabilities. We strongly recommend conducting meta-evaluations of the model performance before deploying it in specialized or highly complex evaluation scenarios.