ThalesR's picture
Update README.md
f7c757f verified
---
library_name: transformers
tags: []
---
# 🎯 ClassiCC-PT Classifiers
## 📖 Overview
The ClassiCC-PT classifiers are three BERTimbau-based neural classifiers designed for Portuguese web documents, trained on GPT-4o–annotated data.
They were created to support content-based filtering in large-scale Portuguese corpora and are part of the ClassiCC-PT dataset pipeline.
**This repository contains the Educational classifier.**
The classifiers provide document-level scores (0–5) for:
Educational Content (ClassiCC-PT-edu)
STEM Content (ClassiCC-PT-STEM)
Toxic Content (ClassiCC-PT-toxic)
## 🏗 Training Setup
Base model: BERTimbau Base
Head: Linear regression layer
Objective: Predict discrete scores (0–5) assigned by GPT-4o
Optimizer: AdamW (lr = 3e-4)
Scheduler: Cosine decay with 5% warmup
Epochs: 20
Train Hardware: A100 gpus
## 📊 Performance
All classifiers are evaluated both as regressors and in binary classification mode (score ≥ 3 → positive).
| Classifier | Task | Test Size | Train Size | F1 (Binary) |
| ----------------- | ----------------------- | --------- | ---------- | ----------- |
| ClassiCC-PT-edu | Educational Content | 10k | 110k | **0.77** |
| ClassiCC-PT-STEM | STEM Content | 12k | 100k | **0.76** |
| ClassiCC-PT-toxic | Toxic/Offensive Content | 20k | 180k | **0.78** |
For comparison, the [FineWeb-Edu classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) (trained only in English) achieved only 0.48 F1 on Portuguese data, highlighting the need for language-specific models.
## 💡 Intended Use
These classifiers were built for pretraining corpus filtering but can also be used for:
Dataset annotation for educational/STEM/toxic content
Research in Portuguese NLP content classification
Filtering user-generated content in applications targeting Portuguese speakers
## Usage
```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "ClassiCC-Corpus/ClassiCC-PT-edu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "A fotossíntese é o processo pelo qual as plantas convertem energia luminosa em energia química."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
score = outputs.logits.squeeze(-1).float().cpu().numpy()
print(f"Score: {score:.2f}")
``
## 📜 Citation
If you use these classifiers, please cite:
```
coming soon
```