Taja Kuzman
Update README.md
b37fe62 verified
|
raw
history blame
18.9 kB
---
license: cc-by-sa-4.0
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
tags:
- text-classification
- IPTC
- news
- news topic
- IPTC topic
- IPTC NewsCode
- topic categorization
---
# Multilingual IPTC Media Topic Classifier
Text classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
and fine-tuned on a news corpus in 4 languages (Croatian, Slovenian, Catalan and Greek), automatically annotated by the OpenAI's GPT-4o
model with the [top-level IPTC
Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
The model can be used for classification into topic labels from the
[IPTC NewsCodes schema](https://iptc.org/std/NewsCodes/guidelines/#_what_are_the_iptc_newscodes) and can be
applied to any news text in a language, supported by the `xlm-roberta-large`.
## Intended use and limitations
For reliable results, the classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words).
Use example:
```python
from simpletransformers.classification import ClassificationModel
model_args ={
"num_train_epochs": 5,
"learning_rate": 8e-06,
"train_batch_size": 32,
"max_seq_length": 512,
"silent": True,
}
model = ClassificationModel(
"xlmroberta", "classla/multilingual-IPTC-news-topic-classifier", use_cuda=True,
args=model_args
)
predictions, logit_output = model.predict([
"Slovenian handball team makes it to Paris Olympics semifinal Lille, 8 August - Slovenia defeated Norway 33:28 in the Olympic men's handball tournament in Lille late on Wednesday to advance to the semifinal where they will face Denmark on Friday evening. This is the best result the team has so far achieved at the Olympic Games and one of the best performances in the history of Slovenia's team sports squads.",
"Second hottest July breaks 13-month record streak, EU scientists say BRUSSELS, Aug 8 (Reuters) - Last month was the second hottest July for the planet on record, breaking a 13-month period when each month was warmest, which had been in part fuelled by the warming El Nino weather pattern, the European Union's Copernicus Climate Change Service said on Thursday. The month was 1.48 degrees Celsius (2.7 degrees Fahrenheit) above the pre-industrial reference of 1850-1990, Copernicus said in a monthly report, while the last 12 months were 1.64 C above the pre-industrial average due to climate change."]
)
predictions
# Output: array([3, 14])
[model.config.id2label[i] for i in predictions]
# Output: ['sport', 'weather']
```
## IPTC Media Topic categories
The classifier uses the top-level of the IPTC Media Topic NewsCodes schema, consisting of 17 labels.
List of labels:
```
labels_list=['education', 'human interest', 'society', 'sport', 'crime, law and justice',
'disaster, accident and emergency incident', 'arts, culture, entertainment and media', 'politics',
'economy, business and finance', 'lifestyle and leisure', 'science and technology',
'health', 'labour', 'religion', 'weather', 'environment', 'conflict, war and peace'],
labels_map={0: 'education', 1: 'human interest', 2: 'society', 3: 'sport', 4: 'crime, law and justice',
5: 'disaster, accident and emergency incident', 6: 'arts, culture, entertainment and media',
7: 'politics', 8: 'economy, business and finance', 9: 'lifestyle and leisure', 10: 'science and technology',
11: 'health', 12: 'labour', 13: 'religion', 14: 'weather', 15: 'environment', 16: 'conflict, war and peace'}
```
Description of labels:
The descriptions of the labels are based on the descriptions provided in the [IPTC Media Topic NewsCodes schema](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html)
and enriched with information which specific subtopics belong to the top-level topics, based on the IPTC Media Topic hierarchy.
| Label | Description |
|:------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| disaster, accident and emergency incident | Man-made or natural events resulting in injuries, death or damage, e.g., explosions, transport accidents, famine, drowning, natural disasters, emergency planning and response. |
| human interest | News about life and behavior of royalty and celebrities, news about obtaining awards, ceremonies (graduation, wedding, funeral, celebration of launching something), birthdays and anniversaries, and news about silly or stupid human errors. |
| politics | News about local, regional, national and international exercise of power, including news about election, fundamental rights, government, non-governmental organisations, political crises, non-violent international relations, public employees, government policies. |
| education | All aspects of furthering knowledge, formally or informally, including news about schools, curricula, grading, remote learning, teachers and students. |
| crime, law and justice | News about committed crime and illegal activities, the system of courts, law and law enforcement (e.g., judges, lawyers, trials, punishments of offenders). |
| economy, business and finance | News about companies, products and services, any kind of industries, national economy, international trading, banks, (crypto)currency, business and trade societies, economic trends and indicators (inflation, employment statistics, GDP, mortgages, ...), international economic institutions, utilities (electricity, heating, waste management, water supply). |
| conflict, war and peace | News about terrorism, wars, wars victims, cyber warfare, civil unrest (demonstrations, riots, rebellions), peace talks and other peace activities. |
| arts, culture, entertainment and media | News about cinema, dance, fashion, hairstyle, jewellery, festivals, literature, music, theatre, TV shows, painting, photography, woodworking, art exhibitions, libraries and museums, language, cultural heritage, news media, radio and television, social media, influencers, and disinformation. |
| labour | News about employment, employment legislation, employees and employers, commuting, parental leave, volunteering, wages, social security, labour market, retirement, unemployment, unions. |
| weather | News about weather forecasts, weather phenomena and weather warning. |
| religion | News about religions, cults, religious conflicts, relations between religion and government, churches, religious holidays and festivals, religious leaders and rituals, and religious texts. |
| society | News about social interactions (e.g., networking), demographic analyses, population census, discrimination, efforts for inclusion and equity, emigration and immigration, communities of people and minorities (LGBTQ, older people, children, indigenous people, etc.), homelessness, poverty, societal problems (addictions, bullying), ethical issues (suicide, euthanasia, sexual behavior) and social services and charity, relationships (dating, divorce, marriage), family (family planning, adoption, abortion, contraception, pregnancy, parenting). |
| health | News about diseases, injuries, mental health problems, health treatments, diets, vaccines, drugs, government health care, hospitals, medical staff, health insurance. |
| environment | News about climate change, energy saving, sustainability, pollution, population growth, natural resources, forests, mountains, bodies of water, ecosystem, animals, flowers and plants. |
| lifestyle and leisure | News about hobbies, clubs and societies, games, lottery, enthusiasm about food or drinks, car/motorcycle lovers, public holidays, leisure venues (amusement parks, cafes, bars, restaurants, etc.), exercise and fitness, outdoor recreational activities (e.g., fishing, hunting), travel and tourism, mental well-being, parties, maintaining and decorating house and garden. |
| science and technology | News about natural sciences and social sciences, mathematics, technology and engineering, scientific institutions, scientific research, scientific publications and innovation. |
| sport | News about sports that can be executed in competitions, e.g., basketball, football, swimming, athletics, chess, dog racing, diving, golf, gymnastics, martial arts, climbing, etc.; sport achievements, sport events, sport organisation, sport venues (stadiums, gymnasiums, ...), referees, coaches, sport clubs, drug use in sport. |
## Training data
The model was fine-tuned on a training dataset consisting of 15,000 news in four languages (Croatian, Slovenian, Catalan and Greek).
The news texts were extracted from the [MaCoCu web corpora](https://macocu.eu/) based on the "News" genre label, predicted with the [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier).
The training dataset was automatically annotated with the IPTC Media Topic labels by the GPT-4o model (with prediction accuracy of 0.78 and macro-F1 scores of 0.72).
Label distribution in the training dataset:
| labels | count | proportion |
|:------------------------------------------|--------:|-------------:|
| sport | 2300 | 0.153333 |
| arts, culture, entertainment and media | 2117 | 0.141133 |
| politics | 2018 | 0.134533 |
| economy, business and finance | 1670 | 0.111333 |
| human interest | 1152 | 0.0768 |
| education | 990 | 0.066 |
| crime, law and justice | 884 | 0.0589333 |
| health | 675 | 0.045 |
| disaster, accident and emergency incident | 610 | 0.0406667 |
| society | 481 | 0.0320667 |
| environment | 472 | 0.0314667 |
| lifestyle and leisure | 346 | 0.0230667 |
| science and technology | 340 | 0.0226667 |
| conflict, war and peace | 311 | 0.0207333 |
| labour | 288 | 0.0192 |
| religion | 258 | 0.0172 |
| weather | 88 | 0.00586667 |
## Performance
The model was evaluated on a manually-annotated test set in four languages (Croatian, Slovenian, Catalan and Greek), consisting of 1.130 instances.
The test set contains equal amounts of texts from the four languages and is more or less balanced across labels.
The model was shown to achieve accuracy of 0.78 and macro-F1 scores of 0.72.
### Fine-tuning hyperparameters
Fine-tuning was performed with `simpletransformers`.
Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
```python
model_args = ClassificationArgs()
model_args ={
"num_train_epochs": 5,
"learning_rate": 8e-06,
"train_batch_size": 32,
"max_seq_length": 512,
}
```
## Citation
Paper with the details on the model is currently under work. If you use the model, please cite this repository:
```
@misc{iptc_model,
author={Kuzman, Taja and Ljube{\v{s}}i{\'c}, Nikola},
title = {Multilingual IPTC Media Topic Classifier},
year = 2022,
url = { https://huggingface.co/classla/multilingual-IPTC-news-topic-classifier},
publisher = { Hugging Face }
}
```