Update README.md
Browse files
README.md
CHANGED
|
@@ -399,7 +399,6 @@ You can finetune this model on your own dataset.
|
|
| 399 |
### Metrics
|
| 400 |
|
| 401 |
#### Binary Classification
|
| 402 |
-
* Dataset: `FineTuned_8`
|
| 403 |
* Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
|
| 404 |
|
| 405 |
| Metric | Value |
|
|
@@ -440,6 +439,15 @@ You can finetune this model on your own dataset.
|
|
| 440 |
| max_recall | 0.3936 |
|
| 441 |
| **max_ap** | **0.5012** |
|
| 442 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 443 |
<!--
|
| 444 |
## Bias, Risks and Limitations
|
| 445 |
|
|
@@ -455,6 +463,9 @@ You can finetune this model on your own dataset.
|
|
| 455 |
## Training Details
|
| 456 |
|
| 457 |
### Training Dataset
|
|
|
|
|
|
|
|
|
|
| 458 |
|
| 459 |
#### Unnamed Dataset
|
| 460 |
|
|
@@ -481,8 +492,8 @@ You can finetune this model on your own dataset.
|
|
| 481 |
```
|
| 482 |
|
| 483 |
### Evaluation Dataset
|
| 484 |
-
|
| 485 |
-
####
|
| 486 |
|
| 487 |
|
| 488 |
* Size: 18,355 evaluation samples
|
|
|
|
| 399 |
### Metrics
|
| 400 |
|
| 401 |
#### Binary Classification
|
|
|
|
| 402 |
* Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
|
| 403 |
|
| 404 |
| Metric | Value |
|
|
|
|
| 439 |
| max_recall | 0.3936 |
|
| 440 |
| **max_ap** | **0.5012** |
|
| 441 |
|
| 442 |
+
|
| 443 |
+
The following figure depicts f1, recall, and precision on the test data for different thresholds.
|
| 444 |
+

|
| 445 |
+
|
| 446 |
+
|
| 447 |
+
The following figure depicts how well matches and mismatches in the test data are separated by the model. For results with a minimum of false positives, a threshold higher than 0.91 is recommended. For the optimal F1 score, the right treshold is 0.9050.
|
| 448 |
+

|
| 449 |
+
|
| 450 |
+
|
| 451 |
<!--
|
| 452 |
## Bias, Risks and Limitations
|
| 453 |
|
|
|
|
| 463 |
## Training Details
|
| 464 |
|
| 465 |
### Training Dataset
|
| 466 |
+
The model was trained on a weakly annotated dataset. The data was taken from Telegram. More specifically from a set of about 200 channels that have been subject to a fact-check from either Correctiv, dpa, Faktenfuchs or AFP.
|
| 467 |
+
|
| 468 |
+
Weak annotation was performed using GPT-4o. The model was prompted to find semantically identical posts using this [prompt](https://huggingface.co/Sami92/multiling-e5-large-instruct-claim-matching/blob/main/prompt.txt). For non-matches the cosine similarity was reduced by 1.2 for training and for matches it was frozen to 0.98.
|
| 469 |
|
| 470 |
#### Unnamed Dataset
|
| 471 |
|
|
|
|
| 492 |
```
|
| 493 |
|
| 494 |
### Evaluation Dataset
|
| 495 |
+
Evaluation was performed on a dataset from the same Telegram channels as the training data. Again, GPT-4o was used to identify matching claims. However, for the test data, trained annotators validated the results and mismatches that were classified as matches by GPT-4o were removed. A ratio of 1:30 was chosen. In other words, for 1 match there are 30 mismatches. This is supposed to reflect a realistic scenario in which there are much more posts that are not identical to a query-post.
|
| 496 |
+
#### Manually checked Telegram Dataset
|
| 497 |
|
| 498 |
|
| 499 |
* Size: 18,355 evaluation samples
|