File size: 2,507 Bytes
f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d bb97be3 f61167d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
---
library_name: transformers
tags: []
---
# 🐦 Curió 1.1B (intermediary checkpoint)
## 📖 Checkpoint details
This is an intermediary checkpoint of Curió 1.1B. This checkpoint started from [TinyLlama 1T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-480k-1T) and was trained for 100B tokens from ClassiCC-PT.
The final Curió 1.1B models is available [here](https://huggingface.co/ClassiCC-Corpus/Curio-1.1b)
The ClassiCC corpus is available [here](https://huggingface.co/datasets/ClassiCC-Corpus/ClassiCC-PT)
## 📖 Overview
Curió 1.1B is a Portuguese-adapted language model created via continued pretraining of TinyLlama 1.1B (1T), originally trained on 1 trillion English tokens, on 150B Portuguese tokens from the ClassiCC-PT corpus.
This model was designed to explore the impact of language-specific corpora on adapting an English-trained base model to Portuguese, yielding performance improvements on Portuguese benchmarks without large-scale retraining from scratch.
## 🏗 Training Setup
- Base model: TinyLlama 1.1B (LLaMA-2 architecture)
- Parameters: 1.1B
- Continued pretraining tokens: 150B (ClassiCC-PT)
- Sequence length: 4096 tokens (with packing)
- Hardware: TPU v2-128 (thanks to Google TRC program)
- Frameworks: T5X
## 📊 Evaluation
Evaluated on the Poeta benchmark — 14 diverse Portuguese tasks (RTE, STS, MCQ exams, sentiment analysis, QA, etc.) — using the Normalized Preferred Metric (NPM).
| Model | Training Regimen | Poeta v2 NPM |
| ----------------- | -------------------------------------------- | ------------ |
| TinyLlama 1T (EN) | – | 17.4 |
| TinyLlama 2T (EN) | +1T EN continued pretraining | 20.9 |
| training with mC4-PT | +150B PT (mC4-PT) continued pretraining | \~20 |
| training with ClueWeb-22-PT | +150B PT (Clueweb-22-PT) continued pretraining | \~27 |
| **Curió 1.1B** | +150B PT (ClassiCC-PT) continued pretraining | **27.1** |
## 📥 Usage
Please note that **Curio 1.1B has not trained to be used as a chat model**
```
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "ClassiCC-Corpus/Curio-1.1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
```
## 📜 Citation
If you use Curió 1.1B, please cite:
```
Coming soon
```
|