Update README.md
Browse files
README.md
CHANGED
@@ -716,4 +716,16 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
|
|
716 |
</tr>
|
717 |
</table>
|
718 |
|
719 |
-
\* _reported in model paper using reference answers_
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
716 |
</tr>
|
717 |
</table>
|
718 |
|
719 |
+
\* _reported in model paper using reference answers_
|
720 |
+
|
721 |
+
## License
|
722 |
+
We opted for the Apache 2.0 license for Flow Judge to provide the community with an open, small yet powerful LM evaluator. Our goal is to support the wider adoption of rigorous evaluation techniques in LLM system development, making them more accessible to practitioners and researchers.
|
723 |
+
|
724 |
+
## Limitations and future work
|
725 |
+
Multilingual evaluation: Flow Judge has been fine-tuned exclusively on English data. While the foundation model (Phi-3.5-mini-instruct [17]) may possess multilingual capabilities, we have not systematically evaluated Flow Judge performance in non-English contexts. We plan to explore multi-lingual LM evaluators in the future.
|
726 |
+
|
727 |
+
Long context and structured Inputs: Our training dataset encompasses a wide range of custom metrics relevant to evaluating LLM systems. However, it does not include examples with long context inputs or structured data formats such as JSON, since these are harder to synthetically generate. This limitation may impact Flow Judge's performance when evaluating responses that require processing extensive context or parsing structured input. Extending our model’s capabilities to handle these input types represents an important area for future research.
|
728 |
+
|
729 |
+
Math and coding: The current version has not been trained on specific task domains such as arithmetic problems or code evaluation. As a result, its performance in these specialized areas may be limited. Future iterations of the model should address these gaps.
|
730 |
+
|
731 |
+
Domain-specific knowledge and complex multi-step evaluations: Flow Judge may struggle with highly specialized domain knowledge or proprietary data outside the training scope of its foundation model. Additionally, evaluation tasks requiring multi-step reasoning or complex logical processes may challenge the model's capabilities. We strongly recommend conducting meta-evaluations of the model performance before deploying it in specialized or highly complex evaluation scenarios.
|