File size: 2,507 Bytes
f61167d
 
 
 
 
 
bb97be3
f61167d
bb97be3
f61167d
bb97be3
f61167d
bb97be3
f61167d
bb97be3
f61167d
bb97be3
f61167d
bb97be3
f61167d
bb97be3
f61167d
 
bb97be3
f61167d
bb97be3
f61167d
bb97be3
f61167d
bb97be3
f61167d
bb97be3
f61167d
bb97be3
f61167d
bb97be3
f61167d
 
bb97be3
f61167d
bb97be3
f61167d
 
bb97be3
 
 
 
 
 
 
f61167d
 
 
 
bb97be3
f61167d
bb97be3
f61167d
bb97be3
 
f61167d
bb97be3
f61167d
bb97be3
 
 
f61167d
bb97be3
f61167d
bb97be3
 
 
 
f61167d
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
library_name: transformers
tags: []
---


# 🐦 Curió 1.1B (intermediary checkpoint)

## 📖 Checkpoint details

This is an intermediary checkpoint of Curió 1.1B. This checkpoint started from [TinyLlama 1T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-480k-1T) and was trained for 100B tokens from ClassiCC-PT.

The final Curió 1.1B models is available [here](https://huggingface.co/ClassiCC-Corpus/Curio-1.1b)

The ClassiCC corpus is available [here](https://huggingface.co/datasets/ClassiCC-Corpus/ClassiCC-PT)

## 📖 Overview

Curió 1.1B is a Portuguese-adapted language model created via continued pretraining of TinyLlama 1.1B (1T), originally trained on 1 trillion English tokens, on 150B Portuguese tokens from the ClassiCC-PT corpus.

This model was designed to explore the impact of language-specific corpora on adapting an English-trained base model to Portuguese, yielding performance improvements on Portuguese benchmarks without large-scale retraining from scratch.


## 🏗 Training Setup

- Base model: TinyLlama 1.1B (LLaMA-2 architecture)

- Parameters: 1.1B

- Continued pretraining tokens: 150B (ClassiCC-PT)

- Sequence length: 4096 tokens (with packing)

- Hardware: TPU v2-128 (thanks to Google TRC program)

- Frameworks: T5X


## 📊 Evaluation

Evaluated on the Poeta benchmark — 14 diverse Portuguese tasks (RTE, STS, MCQ exams, sentiment analysis, QA, etc.) — using the Normalized Preferred Metric (NPM).


| Model             | Training Regimen                             | Poeta v2 NPM |
| ----------------- | -------------------------------------------- | ------------ |
| TinyLlama 1T (EN) | –                                            | 17.4         |
| TinyLlama 2T (EN) | +1T EN continued pretraining                 | 20.9         |
| training with mC4-PT            | +150B PT (mC4-PT) continued pretraining           | \~20         |
| training with ClueWeb-22-PT     | +150B PT (Clueweb-22-PT) continued pretraining           | \~27         |
| **Curió 1.1B**    | +150B PT (ClassiCC-PT) continued pretraining | **27.1**     |




## 📥 Usage

Please note that **Curio 1.1B has not trained to be used as a chat model**

```
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "ClassiCC-Corpus/Curio-1.1B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
```

## 📜 Citation

If you use Curió 1.1B, please cite:
```
Coming soon
```