classla
/

multilingual-IPTC-news-topic-classifier

Text Classification

topic categorization

Model card Files Files and versions

Taja Kuzman commited on Sep 13, 2024

Commit

ad2fac9

·

verified ·

1 Parent(s): 8c99438

Update README.md

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -136,7 +136,8 @@ The model can be used for classification into topic labels from the
 Based on a manually-annotated test set (in Croatian, Slovenian, Catalan and Greek),
 the model achieves micro-F1 score of 0.733, macro-F1 score of 0.745 and accuracy of 0.733,
 and outperforms the GPT-4o model (version `gpt-4o-2024-05-13`) used in a zero-shot setting.
- If we use only labels that are predicted with a confidence score equal or higher than 0.90, the model achieves micro-F1 and macro-F1 of 0.80.
 ## Intended use and limitations
@@ -216,7 +217,8 @@ and enriched with information which specific subtopics belong to the top-level t
 The model was fine-tuned on a training dataset consisting of 15,000 news in four languages (Croatian, Slovenian, Catalan and Greek).
 The news texts were extracted from the [MaCoCu web corpora](https://macocu.eu/) based on the "News" genre label, predicted with the [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier).
-The training dataset was automatically annotated with the IPTC Media Topic labels by the [GPT-4o](https://platform.openai.com/docs/models/gpt-4o) model (with prediction accuracy of 0.78 and macro-F1 scores of 0.72).
 Label distribution in the training dataset:

 Based on a manually-annotated test set (in Croatian, Slovenian, Catalan and Greek),
 the model achieves micro-F1 score of 0.733, macro-F1 score of 0.745 and accuracy of 0.733,
 and outperforms the GPT-4o model (version `gpt-4o-2024-05-13`) used in a zero-shot setting.
+ If we use only labels that are predicted with a confidence score equal or higher than 0.90,
+ the model achieves micro-F1 and macro-F1 of 0.80.
 ## Intended use and limitations
 The model was fine-tuned on a training dataset consisting of 15,000 news in four languages (Croatian, Slovenian, Catalan and Greek).
 The news texts were extracted from the [MaCoCu web corpora](https://macocu.eu/) based on the "News" genre label, predicted with the [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier).
+The training dataset was automatically annotated with the IPTC Media Topic labels by
+the [GPT-4o](https://platform.openai.com/docs/models/gpt-4o) model (yielding 0.72 micro-F1 and 0.73 macro-F1 on the test dataset).
 Label distribution in the training dataset: