classla
/

xlm-roberta-base-multilingual-text-genre-classifier

Text Classification

Model card Files Files and versions

Taja Kuzman commited on Jan 23

Commit

c7f9bfc

·

verified ·

1 Parent(s): 68c4a0f

Update README.md

Files changed (1) hide show

README.md +5 -1

README.md CHANGED Viewed

@@ -277,13 +277,17 @@ trained on all three datasets, outperforms classifiers that were trained on just
 Additionally, we evaluated the X-GENRE classifier on a multilingual X-GINCO dataset that comprises samples
 of texts from the MaCoCu web corpora (http://hdl.handle.net/11356/1969).
-The X-GINCO dataset comprises 790 instances in 10 languages -
 Albanian, Croatian, Catalan, Greek, Icelandic, Macedonian, Maltese, Slovenian, Turkish, and Ukrainian.
 To evaluate the performance on genre labels, the dataset is balanced by labels,
 and the vague label "Other" is not included.
 Additionally, instances that were predicted with a confidence score below 0.80 were not included in the test dataset.
 The evaluation shows high cross-lingual performance of the model,
 even when applied to languages that are not related to the training languages (English and Slovenian) and when applied on non-Latin scripts.
 The outlier is Maltese, on which classifier does not perform well -
 we presume that this is due to the fact that Maltese is not included in the pretraining data of the XLM-RoBERTa model.

 Additionally, we evaluated the X-GENRE classifier on a multilingual X-GINCO dataset that comprises samples
 of texts from the MaCoCu web corpora (http://hdl.handle.net/11356/1969).
+The X-GINCO dataset comprises 790 manually-annotated instances in 10 languages -
 Albanian, Croatian, Catalan, Greek, Icelandic, Macedonian, Maltese, Slovenian, Turkish, and Ukrainian.
 To evaluate the performance on genre labels, the dataset is balanced by labels,
 and the vague label "Other" is not included.
 Additionally, instances that were predicted with a confidence score below 0.80 were not included in the test dataset.
 The evaluation shows high cross-lingual performance of the model,
 even when applied to languages that are not related to the training languages (English and Slovenian) and when applied on non-Latin scripts.
 The outlier is Maltese, on which classifier does not perform well -
 we presume that this is due to the fact that Maltese is not included in the pretraining data of the XLM-RoBERTa model.