File size: 2,660 Bytes
72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 72c2793 0e8eb19 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
---
library_name: transformers
tags: []
---
# 🎯 ClassiCC-PT Classifiers
## 📖 Overview
The ClassiCC-PT classifiers are three BERTimbau-based neural classifiers designed for Portuguese web documents, trained on GPT-4o–annotated data.
They were created to support content-based filtering in large-scale Portuguese corpora and are part of the ClassiCC-PT dataset pipeline.
**This repository contains the STEM classifier.**
The classifiers provide document-level scores (0–5) for:
Educational Content (ClassiCC-PT-edu)
STEM Content (ClassiCC-PT-STEM)
Toxic Content (ClassiCC-PT-toxic)
## 🏗 Training Setup
Base model: BERTimbau Base
Head: Linear regression layer
Objective: Predict discrete scores (0–5) assigned by GPT-4o
Optimizer: AdamW (lr = 3e-4)
Scheduler: Cosine decay with 5% warmup
Epochs: 20
Train Hardware: A100 gpus
## 📊 Performance
All classifiers are evaluated both as regressors and in binary classification mode (score ≥ 3 → positive).
| Classifier | Task | Test Size | Train Size | F1 (Binary) |
| ----------------- | ----------------------- | --------- | ---------- | ----------- |
| ClassiCC-PT-edu | Educational Content | 10k | 110k | **0.77** |
| ClassiCC-PT-STEM | STEM Content | 12k | 100k | **0.76** |
| ClassiCC-PT-toxic | Toxic/Offensive Content | 20k | 180k | **0.78** |
For comparison, the [FineWeb-Edu classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) (trained only in English) achieved only 0.48 F1 on Portuguese data, highlighting the need for language-specific models.
## 💡 Intended Use
These classifiers were built for pretraining corpus filtering but can also be used for:
Dataset annotation for educational/STEM/toxic content
Research in Portuguese NLP content classification
Filtering user-generated content in applications targeting Portuguese speakers
## Usage
```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "ClassiCC-Corpus/ClassiCC-PT-edu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "A fotossíntese é o processo pelo qual as plantas convertem energia luminosa em energia química."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
score = outputs.logits.squeeze(-1).float().cpu().numpy()
print(f"Score: {score:.2f}")
``
## 📜 Citation
If you use these classifiers, please cite:
```
coming soon
``` |