ClassiCC-Corpus
/

Curio-1.1b-intermediate-checkpoint-100B

Text Generation

text-generation-inference

Model card Files Files and versions

Curio-1.1b-intermediate-checkpoint-100B / README.md

ThalesR's picture

Update README.md

bb97be3 verified about 2 months ago

|

history blame contribute delete

2.51 kB

	---
	library_name: transformers
	tags: []
	---


	# 🐦 Curió 1.1B (intermediary checkpoint)

	## 📖 Checkpoint details

	This is an intermediary checkpoint of Curió 1.1B. This checkpoint started from [TinyLlama 1T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-480k-1T) and was trained for 100B tokens from ClassiCC-PT.

	The final Curió 1.1B models is available [here](https://huggingface.co/ClassiCC-Corpus/Curio-1.1b)

	The ClassiCC corpus is available [here](https://huggingface.co/datasets/ClassiCC-Corpus/ClassiCC-PT)

	## 📖 Overview

	Curió 1.1B is a Portuguese-adapted language model created via continued pretraining of TinyLlama 1.1B (1T), originally trained on 1 trillion English tokens, on 150B Portuguese tokens from the ClassiCC-PT corpus.

	This model was designed to explore the impact of language-specific corpora on adapting an English-trained base model to Portuguese, yielding performance improvements on Portuguese benchmarks without large-scale retraining from scratch.


	## 🏗 Training Setup

	- Base model: TinyLlama 1.1B (LLaMA-2 architecture)

	- Parameters: 1.1B

	- Continued pretraining tokens: 150B (ClassiCC-PT)

	- Sequence length: 4096 tokens (with packing)

	- Hardware: TPU v2-128 (thanks to Google TRC program)

	- Frameworks: T5X


	## 📊 Evaluation

	Evaluated on the Poeta benchmark — 14 diverse Portuguese tasks (RTE, STS, MCQ exams, sentiment analysis, QA, etc.) — using the Normalized Preferred Metric (NPM).


	\| Model \| Training Regimen \| Poeta v2 NPM \|
	\| ----------------- \| -------------------------------------------- \| ------------ \|
	\| TinyLlama 1T (EN) \| – \| 17.4 \|
	\| TinyLlama 2T (EN) \| +1T EN continued pretraining \| 20.9 \|
	\| training with mC4-PT \| +150B PT (mC4-PT) continued pretraining \| \~20 \|
	\| training with ClueWeb-22-PT \| +150B PT (Clueweb-22-PT) continued pretraining \| \~27 \|
	\| Curió 1.1B \| +150B PT (ClassiCC-PT) continued pretraining \| 27.1 \|




	## 📥 Usage

	Please note that Curio 1.1B has not trained to be used as a chat model

	```
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "ClassiCC-Corpus/Curio-1.1B"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)
	```

	## 📜 Citation

	If you use Curió 1.1B, please cite:
	```
	Coming soon
	```