File size: 2,660 Bytes
72c2793
 
 
 
 
0e8eb19
72c2793
0e8eb19
72c2793
0e8eb19
 
72c2793
0e8eb19
72c2793
0e8eb19
72c2793
0e8eb19
72c2793
0e8eb19
72c2793
0e8eb19
72c2793
 
0e8eb19
72c2793
0e8eb19
72c2793
0e8eb19
72c2793
0e8eb19
72c2793
0e8eb19
72c2793
0e8eb19
72c2793
0e8eb19
72c2793
0e8eb19
72c2793
 
0e8eb19
72c2793
0e8eb19
72c2793
0e8eb19
 
 
 
 
72c2793
0e8eb19
72c2793
 
0e8eb19
72c2793
0e8eb19
72c2793
0e8eb19
72c2793
0e8eb19
72c2793
0e8eb19
72c2793
 
0e8eb19
72c2793
0e8eb19
 
 
72c2793
0e8eb19
 
 
72c2793
0e8eb19
 
 
 
 
 
72c2793
0e8eb19
72c2793
0e8eb19
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
library_name: transformers
tags: []
---

# 🎯 ClassiCC-PT Classifiers

## 📖 Overview

The ClassiCC-PT classifiers are three BERTimbau-based neural classifiers designed for Portuguese web documents, trained on GPT-4o–annotated data.
They were created to support content-based filtering in large-scale Portuguese corpora and are part of the ClassiCC-PT dataset pipeline.

**This repository contains the STEM classifier.**

The classifiers provide document-level scores (0–5) for:

    Educational Content (ClassiCC-PT-edu)

    STEM Content (ClassiCC-PT-STEM)

    Toxic Content (ClassiCC-PT-toxic)


## 🏗 Training Setup

    Base model: BERTimbau Base

    Head: Linear regression layer

    Objective: Predict discrete scores (0–5) assigned by GPT-4o

    Optimizer: AdamW (lr = 3e-4)

    Scheduler: Cosine decay with 5% warmup

    Epochs: 20

    Train Hardware: A100 gpus


## 📊 Performance

All classifiers are evaluated both as regressors and in binary classification mode (score ≥ 3 → positive).

| Classifier        | Task                    | Test Size | Train Size | F1 (Binary) |
| ----------------- | ----------------------- | --------- | ---------- | ----------- |
| ClassiCC-PT-edu   | Educational Content     | 10k       | 110k       | **0.77**    |
| ClassiCC-PT-STEM  | STEM Content            | 12k       | 100k       | **0.76**    |
| ClassiCC-PT-toxic | Toxic/Offensive Content | 20k       | 180k       | **0.78**    |

For comparison, the [FineWeb-Edu classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) (trained only in English) achieved only 0.48 F1 on Portuguese data, highlighting the need for language-specific models.


## 💡 Intended Use

These classifiers were built for pretraining corpus filtering but can also be used for:

    Dataset annotation for educational/STEM/toxic content

    Research in Portuguese NLP content classification

    Filtering user-generated content in applications targeting Portuguese speakers


## Usage

```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "ClassiCC-Corpus/ClassiCC-PT-edu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "A fotossíntese é o processo pelo qual as plantas convertem energia luminosa em energia química."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
score = outputs.logits.squeeze(-1).float().cpu().numpy()
print(f"Score: {score:.2f}")
``

## 📜 Citation

If you use these classifiers, please cite:
```
coming soon
```