|
--- |
|
language: en |
|
license: mit |
|
tags: |
|
- gpt2 |
|
- causal-lm |
|
- from-scratch |
|
- tinystories |
|
datasets: |
|
- roneneldan/TinyStories |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
|
|
# GPT-2-Style TinyStories Model (From Scratch) |
|
|
|
## Overview |
|
This repository contains a GPT-2–style language model trained from scratch on the [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset using Hugging Face’s Transformers library on Google Colab Pro+ A100 GPU. |
|
The training objective was to build a small, educational, and easily reproducible transformer LM for story generation. |
|
|
|
**This model is designed for:** |
|
- Researchers exploring end-to-end LLM training workflows. |
|
- Beginners who want a hands-on example of training a transformer from scratch. |
|
- Educators demonstrating modern NLP model development without huge compute budgets. |
|
|
|
--- |
|
## Hardware & Environment |
|
- **Platform**: Google Colab Pro+ |
|
- **GPU**: NVIDIA A100 (40 GB VRAM) |
|
- **CPU RAM**: 83.5 GB |
|
- **Disk**: 235.7 GB |
|
- **Python**: 3.x (Colab default) |
|
- **Frameworks**: |
|
- `transformers` (latest from pip) |
|
- `datasets` |
|
- `accelerate` |
|
- `huggingface_hub` |
|
|
|
--- |
|
## Dataset |
|
**Dataset**: `roneneldan/TinyStories` — a curated synthetic dataset of short children’s stories. |
|
- **Language**: English |
|
- **Cleanliness**: High — minimal preprocessing needed |
|
- **Structure**: Each sample contains a single text field with a complete story |
|
|
|
**Why this dataset?** |
|
- High signal-to-noise ratio. |
|
- Ideal for small models — vocabulary is modest, sentence structures are simple. |
|
- Useful for quick iterations and visible training convergence. |
|
|
|
--- |
|
## Model Architecture |
|
A small GPT-2–like causal language model: |
|
|
|
| Hyperparameter | Value | |
|
|-----------------|---------| |
|
| Layers (n_layer) | 8 | |
|
| Attention Heads (n_head) | 8 | |
|
| Embedding Dim (n_embd) | 256 | |
|
| Vocabulary Size | 16,384 | |
|
| Sequence Length (block_size) | 512 | |
|
| Params (approx.) | ~10–12M | |
|
| Rotary Positional Embeddings | Disabled | |
|
| Dropout | 0.0 | |
|
| Loss Function | ForCausalLMLoss (auto-selected) | |
|
|
|
--- |
|
## Training Setup |
|
```python |
|
TrainingArguments( |
|
num_train_epochs = 3, |
|
per_device_train_batch_size = 128, |
|
per_device_eval_batch_size = 128, |
|
gradient_accumulation_steps = 1, |
|
learning_rate = 3e-4, |
|
weight_decay = 0.1, |
|
warmup_ratio = 0.03, |
|
logging_steps = 50, |
|
save_steps = 500, |
|
save_total_limit = 3, |
|
bf16 = True, # Mixed precision |
|
fp16 = False, |
|
evaluation_strategy = "steps", |
|
eval_steps = 500, |
|
) |
|
``` |
|
- **Optimizer**: AdamW (default in HF Trainer) |
|
- **Data Loading**: `datasets` streaming & tokenization with `block_size=512` |
|
- **Collator**: `DataCollatorForLanguageModeling` with `mlm=False` |
|
|
|
--- |
|
## Tokenization & Preprocessing |
|
```python |
|
from itertools import chain |
|
|
|
def tokenize_fn(batch): |
|
return tokenizer(batch["text"], add_special_tokens=False) |
|
|
|
tokenized = raw.map(tokenize_fn, batched=True, remove_columns=raw['train'].column_names) |
|
|
|
def group_texts(examples): |
|
concatenated = list(chain(*examples["input_ids"])) |
|
total_length = (len(concatenated) // CFG.block_size) * CFG.block_size |
|
concatenated = concatenated[:total_length] |
|
result = { |
|
"input_ids": [concatenated[i:i+CFG.block_size] for i in range(0, total_length, CFG.block_size)] |
|
} |
|
result["labels"] = result["input_ids"].copy() |
|
return result |
|
|
|
lm_datasets = tokenized.map(group_texts, batched=True) |
|
``` |
|
|
|
--- |
|
|
|
## Tokens |
|
- **Number of sequences in train set**: 899,394 |
|
- **Tokens per step**: 65,536 |
|
- **Steps per epoch**: 7,026 |
|
- **Total steps**: 21,078 |
|
- **Total tokens processed**: 1,381,367,808 |
|
|
|
--- |
|
## Training Run & Metrics |
|
- **Total steps**: 21,081 |
|
- **Total FLOPs**: 5.24 × 10^16 |
|
- **Runtime**: ~1h 44m on A100 (Colab) |
|
- **Final Train Loss**: 1.8054 |
|
|
|
Loss curve snapshot (selected steps): |
|
```yaml |
|
Step Loss |
|
50 9.2160 |
|
100 8.2987 |
|
500 3.6695 |
|
1000 2.6862 |
|
5000 1.7699 |
|
10000 1.6385 |
|
15000 1.5620 |
|
21000 1.5140 |
|
``` |
|
**Interpretation**: |
|
Rapid drop in loss during early steps indicates effective learning. |
|
Final loss ≈ 1.51 suggests the model has learned coherent structure and vocabulary use for TinyStories-style text. |
|
|
|
--- |
|
## Inference Example |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import torch |
|
|
|
repo_id = "vijaymohan/gpt2-tinystories-from-scratch-10m" |
|
tokenizer = AutoTokenizer.from_pretrained(repo_id) |
|
if tokenizer.pad_token is None: |
|
tokenizer.pad_token = tokenizer.eos_token |
|
|
|
model = AutoModelForCausalLM.from_pretrained(repo_id, torch_dtype=torch.float16) |
|
if torch.cuda.is_available(): |
|
model.to("cuda") |
|
|
|
prompt = "One day, a little girl named Lily found a needle in her" |
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
with torch.inference_mode(): |
|
outputs = model.generate( |
|
**inputs, |
|
max_new_tokens=100, |
|
do_sample=True, |
|
temperature=0.7, |
|
top_p=0.9, |
|
repetition_penalty=1.1, |
|
eos_token_id=tokenizer.eos_token_id, |
|
pad_token_id=tokenizer.pad_token_id |
|
) |
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
--- |
|
## Lessons & Recommendations for Newcomers |
|
- **Start Small** — Begin with a small dataset and small model. You’ll see results quickly without burning GPU time. |
|
- **Mixed Precision (bf16/fp16)** — Saves VRAM and speeds up training. |
|
- **Clean Data** — High-quality datasets like TinyStories make it easier to reach good results. |
|
- **Checkpoints** — Save regularly (`save_steps`) in case Colab disconnects. |
|
- **Colab Session Stability** — Keep your browser awake, use a stable internet connection. |
|
- **Publishing Early** — Push checkpoints to Hugging Face to avoid accidental data loss. |
|
|
|
--- |
|
## Limitations |
|
- Short context length (512 tokens). |
|
- Limited generalization beyond TinyStories style/content. |
|
- Not suitable for factual QA or large-context reasoning. |
|
|