Sami92
/

multiling-e5-large-instruct-claim-matching

Sentence Similarity

sentence-transformers

feature-extraction

Generated from Trainer

dataset_size:51106

text-embeddings-inference

Model card Files Files and versions

Sami92 commited on Aug 15, 2024

Commit

426f593

·

verified ·

1 Parent(s): 610be85

Update README.md

Files changed (1) hide show

README.md +14 -3

README.md CHANGED Viewed

@@ -399,7 +399,6 @@ You can finetune this model on your own dataset.
 ### Metrics
 #### Binary Classification
-* Dataset: `FineTuned_8`
 * Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
 | Metric                       | Value      |
@@ -440,6 +439,15 @@ You can finetune this model on your own dataset.
 | max_recall                   | 0.3936     |
 | **max_ap**                   | **0.5012** |
 <!--
 ## Bias, Risks and Limitations
@@ -455,6 +463,9 @@ You can finetune this model on your own dataset.
 ## Training Details
 ### Training Dataset
 #### Unnamed Dataset
@@ -481,8 +492,8 @@ You can finetune this model on your own dataset.
   ```
 ### Evaluation Dataset
-#### Unnamed Dataset
 * Size: 18,355 evaluation samples

 ### Metrics
 #### Binary Classification
 * Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
 | Metric                       | Value      |
 | max_recall                   | 0.3936     |
 | **max_ap**                   | **0.5012** |
+The following figure depicts f1, recall, and precision on the test data for different thresholds.
+![](./threshold_scores.jpg)
+The following figure depicts how well matches and mismatches in the test data are separated by the model. For results with a minimum of false positives, a threshold higher than 0.91 is recommended. For the optimal F1 score, the right treshold is 0.9050.
+![](./similarity_histogram.jpg)
 <!--
 ## Bias, Risks and Limitations
 ## Training Details
 ### Training Dataset
+The model was trained on a weakly annotated dataset. The data was taken from Telegram. More specifically from a set of about 200 channels that have been subject to a fact-check from either Correctiv, dpa, Faktenfuchs or AFP.
+Weak annotation was performed using GPT-4o. The model was prompted to find semantically identical posts using this [prompt](https://huggingface.co/Sami92/multiling-e5-large-instruct-claim-matching/blob/main/prompt.txt). For non-matches the cosine similarity was reduced by 1.2 for training and for matches it was frozen to 0.98.
 #### Unnamed Dataset
   ```
 ### Evaluation Dataset
+Evaluation was performed on a dataset from the same Telegram channels as the training data. Again, GPT-4o was used to identify matching claims. However, for the test data, trained annotators validated the results and mismatches that were classified as matches by GPT-4o were removed. A ratio of 1:30 was chosen. In other words, for 1 match there are 30 mismatches. This is supposed to reflect a realistic scenario in which there are much more posts that are not identical to a query-post.
+#### Manually checked Telegram Dataset
 * Size: 18,355 evaluation samples