Update README.md
Browse files
README.md
CHANGED
@@ -23,7 +23,8 @@ Jan-v1 leverages the newly released [Qwen3-4B-thinking](https://huggingface.co/Q
|
|
23 |
|
24 |
## Evaluation
|
25 |
|
26 |
-
|
|
|
27 |
|
28 |
| Model | SimpleQA Accuracy |
|
29 |
| :--- | :--- |
|
@@ -44,6 +45,17 @@ Jan-v1's strategic scaling has resulted in a notable performance uplift. Followi
|
|
44 |
|
45 |
*The 91.2% SimpleQA accuracy represents a significant milestone in factual question answering for models of this scale, demonstrating the effectiveness of our scaling and fine-tuning approach.*
|
46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
## Quick Start
|
48 |
|
49 |
### Integration with Jan App
|
|
|
23 |
|
24 |
## Evaluation
|
25 |
|
26 |
+
### Question Answering (SimpleQA)
|
27 |
+
For question-answering, Jan-v1 shows a significant performance gain from model scaling, achieving 91.2% accuracy.
|
28 |
|
29 |
| Model | SimpleQA Accuracy |
|
30 |
| :--- | :--- |
|
|
|
45 |
|
46 |
*The 91.2% SimpleQA accuracy represents a significant milestone in factual question answering for models of this scale, demonstrating the effectiveness of our scaling and fine-tuning approach.*
|
47 |
|
48 |
+
### Report Generation & Factuality
|
49 |
+
Evaluated on a benchmark testing factual report generation from web sources, using an LLM-as-judge. The benchmark includes our proprietary `Jan Exam - Longform` and the `DeepResearchBench`.
|
50 |
+
|
51 |
+
| Model | Average Overall Score |
|
52 |
+
| :--- | :--- |
|
53 |
+
| o4-mini | 7.30 |
|
54 |
+
| **Jan-v1-4B (Ours)** | **7.17** |
|
55 |
+
| gpt-4.1 | 6.90 |
|
56 |
+
| Qwen3-4B-Thinking-2507 | 6.84 |
|
57 |
+
| 4o-mini | 6.60 |
|
58 |
+
| Jan-nano-128k | 5.63 |
|
59 |
## Quick Start
|
60 |
|
61 |
### Integration with Jan App
|