Update README.md
Browse files
README.md
CHANGED
|
@@ -23,7 +23,8 @@ Jan-v1 leverages the newly released [Qwen3-4B-thinking](https://huggingface.co/Q
|
|
| 23 |
|
| 24 |
## Evaluation
|
| 25 |
|
| 26 |
-
|
|
|
|
| 27 |
|
| 28 |
| Model | SimpleQA Accuracy |
|
| 29 |
| :--- | :--- |
|
|
@@ -44,6 +45,17 @@ Jan-v1's strategic scaling has resulted in a notable performance uplift. Followi
|
|
| 44 |
|
| 45 |
*The 91.2% SimpleQA accuracy represents a significant milestone in factual question answering for models of this scale, demonstrating the effectiveness of our scaling and fine-tuning approach.*
|
| 46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
## Quick Start
|
| 48 |
|
| 49 |
### Integration with Jan App
|
|
|
|
| 23 |
|
| 24 |
## Evaluation
|
| 25 |
|
| 26 |
+
### Question Answering (SimpleQA)
|
| 27 |
+
For question-answering, Jan-v1 shows a significant performance gain from model scaling, achieving 91.2% accuracy.
|
| 28 |
|
| 29 |
| Model | SimpleQA Accuracy |
|
| 30 |
| :--- | :--- |
|
|
|
|
| 45 |
|
| 46 |
*The 91.2% SimpleQA accuracy represents a significant milestone in factual question answering for models of this scale, demonstrating the effectiveness of our scaling and fine-tuning approach.*
|
| 47 |
|
| 48 |
+
### Report Generation & Factuality
|
| 49 |
+
Evaluated on a benchmark testing factual report generation from web sources, using an LLM-as-judge. The benchmark includes our proprietary `Jan Exam - Longform` and the `DeepResearchBench`.
|
| 50 |
+
|
| 51 |
+
| Model | Average Overall Score |
|
| 52 |
+
| :--- | :--- |
|
| 53 |
+
| o4-mini | 7.30 |
|
| 54 |
+
| **Jan-v1-4B (Ours)** | **7.17** |
|
| 55 |
+
| gpt-4.1 | 6.90 |
|
| 56 |
+
| Qwen3-4B-Thinking-2507 | 6.84 |
|
| 57 |
+
| 4o-mini | 6.60 |
|
| 58 |
+
| Jan-nano-128k | 5.63 |
|
| 59 |
## Quick Start
|
| 60 |
|
| 61 |
### Integration with Jan App
|