ThalesR's picture
Update README.md
bb97be3 verified
---
library_name: transformers
tags: []
---
# 🐦 Curió 1.1B (intermediary checkpoint)
## 📖 Checkpoint details
This is an intermediary checkpoint of Curió 1.1B. This checkpoint started from [TinyLlama 1T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-480k-1T) and was trained for 100B tokens from ClassiCC-PT.
The final Curió 1.1B models is available [here](https://huggingface.co/ClassiCC-Corpus/Curio-1.1b)
The ClassiCC corpus is available [here](https://huggingface.co/datasets/ClassiCC-Corpus/ClassiCC-PT)
## 📖 Overview
Curió 1.1B is a Portuguese-adapted language model created via continued pretraining of TinyLlama 1.1B (1T), originally trained on 1 trillion English tokens, on 150B Portuguese tokens from the ClassiCC-PT corpus.
This model was designed to explore the impact of language-specific corpora on adapting an English-trained base model to Portuguese, yielding performance improvements on Portuguese benchmarks without large-scale retraining from scratch.
## 🏗 Training Setup
- Base model: TinyLlama 1.1B (LLaMA-2 architecture)
- Parameters: 1.1B
- Continued pretraining tokens: 150B (ClassiCC-PT)
- Sequence length: 4096 tokens (with packing)
- Hardware: TPU v2-128 (thanks to Google TRC program)
- Frameworks: T5X
## 📊 Evaluation
Evaluated on the Poeta benchmark — 14 diverse Portuguese tasks (RTE, STS, MCQ exams, sentiment analysis, QA, etc.) — using the Normalized Preferred Metric (NPM).
| Model | Training Regimen | Poeta v2 NPM |
| ----------------- | -------------------------------------------- | ------------ |
| TinyLlama 1T (EN) | – | 17.4 |
| TinyLlama 2T (EN) | +1T EN continued pretraining | 20.9 |
| training with mC4-PT | +150B PT (mC4-PT) continued pretraining | \~20 |
| training with ClueWeb-22-PT | +150B PT (Clueweb-22-PT) continued pretraining | \~27 |
| **Curió 1.1B** | +150B PT (ClassiCC-PT) continued pretraining | **27.1** |
## 📥 Usage
Please note that **Curio 1.1B has not trained to be used as a chat model**
```
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "ClassiCC-Corpus/Curio-1.1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
```
## 📜 Citation
If you use Curió 1.1B, please cite:
```
Coming soon
```