|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# 🎯 ClassiCC-PT Classifiers |
|
|
|
## 📖 Overview |
|
|
|
The ClassiCC-PT classifiers are three BERTimbau-based neural classifiers designed for Portuguese web documents, trained on GPT-4o–annotated data. |
|
They were created to support content-based filtering in large-scale Portuguese corpora and are part of the ClassiCC-PT dataset pipeline. |
|
|
|
**This repository contains the Educational classifier.** |
|
|
|
The classifiers provide document-level scores (0–5) for: |
|
|
|
Educational Content (ClassiCC-PT-edu) |
|
|
|
STEM Content (ClassiCC-PT-STEM) |
|
|
|
Toxic Content (ClassiCC-PT-toxic) |
|
|
|
|
|
## 🏗 Training Setup |
|
|
|
Base model: BERTimbau Base |
|
|
|
Head: Linear regression layer |
|
|
|
Objective: Predict discrete scores (0–5) assigned by GPT-4o |
|
|
|
Optimizer: AdamW (lr = 3e-4) |
|
|
|
Scheduler: Cosine decay with 5% warmup |
|
|
|
Epochs: 20 |
|
|
|
Train Hardware: A100 gpus |
|
|
|
|
|
## 📊 Performance |
|
|
|
All classifiers are evaluated both as regressors and in binary classification mode (score ≥ 3 → positive). |
|
|
|
| Classifier | Task | Test Size | Train Size | F1 (Binary) | |
|
| ----------------- | ----------------------- | --------- | ---------- | ----------- | |
|
| ClassiCC-PT-edu | Educational Content | 10k | 110k | **0.77** | |
|
| ClassiCC-PT-STEM | STEM Content | 12k | 100k | **0.76** | |
|
| ClassiCC-PT-toxic | Toxic/Offensive Content | 20k | 180k | **0.78** | |
|
|
|
For comparison, the [FineWeb-Edu classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) (trained only in English) achieved only 0.48 F1 on Portuguese data, highlighting the need for language-specific models. |
|
|
|
|
|
## 💡 Intended Use |
|
|
|
These classifiers were built for pretraining corpus filtering but can also be used for: |
|
|
|
Dataset annotation for educational/STEM/toxic content |
|
|
|
Research in Portuguese NLP content classification |
|
|
|
Filtering user-generated content in applications targeting Portuguese speakers |
|
|
|
|
|
## Usage |
|
|
|
``` |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
|
|
model_name = "ClassiCC-Corpus/ClassiCC-PT-edu" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
text = "A fotossíntese é o processo pelo qual as plantas convertem energia luminosa em energia química." |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) |
|
outputs = model(**inputs) |
|
score = outputs.logits.squeeze(-1).float().cpu().numpy() |
|
print(f"Score: {score:.2f}") |
|
`` |
|
|
|
## 📜 Citation |
|
|
|
If you use these classifiers, please cite: |
|
``` |
|
coming soon |
|
``` |