GPT-2-Style TinyStories Model (From Scratch)
Overview
This repository contains a GPT-2–style language model trained from scratch on the roneneldan/TinyStories dataset using Hugging Face’s Transformers library on Google Colab Pro+ A100 GPU.
The training objective was to build a small, educational, and easily reproducible transformer LM for story generation.
This model is designed for:
- Researchers exploring end-to-end LLM training workflows.
- Beginners who want a hands-on example of training a transformer from scratch.
- Educators demonstrating modern NLP model development without huge compute budgets.
Hardware & Environment
- Platform: Google Colab Pro+
- GPU: NVIDIA A100 (40 GB VRAM)
- CPU RAM: 83.5 GB
- Disk: 235.7 GB
- Python: 3.x (Colab default)
- Frameworks:
transformers
(latest from pip)datasets
accelerate
huggingface_hub
Dataset
Dataset: roneneldan/TinyStories
— a curated synthetic dataset of short children’s stories.
- Language: English
- Cleanliness: High — minimal preprocessing needed
- Structure: Each sample contains a single text field with a complete story
Why this dataset?
- High signal-to-noise ratio.
- Ideal for small models — vocabulary is modest, sentence structures are simple.
- Useful for quick iterations and visible training convergence.
Model Architecture
A small GPT-2–like causal language model:
Hyperparameter | Value |
---|---|
Layers (n_layer) | 8 |
Attention Heads (n_head) | 8 |
Embedding Dim (n_embd) | 256 |
Vocabulary Size | 16,384 |
Sequence Length (block_size) | 512 |
Params (approx.) | ~10–12M |
Rotary Positional Embeddings | Disabled |
Dropout | 0.0 |
Loss Function | ForCausalLMLoss (auto-selected) |
Training Setup
TrainingArguments(
num_train_epochs = 3,
per_device_train_batch_size = 128,
per_device_eval_batch_size = 128,
gradient_accumulation_steps = 1,
learning_rate = 3e-4,
weight_decay = 0.1,
warmup_ratio = 0.03,
logging_steps = 50,
save_steps = 500,
save_total_limit = 3,
bf16 = True, # Mixed precision
fp16 = False,
evaluation_strategy = "steps",
eval_steps = 500,
)
- Optimizer: AdamW (default in HF Trainer)
- Data Loading:
datasets
streaming & tokenization withblock_size=512
- Collator:
DataCollatorForLanguageModeling
withmlm=False
Tokenization & Preprocessing
from itertools import chain
def tokenize_fn(batch):
return tokenizer(batch["text"], add_special_tokens=False)
tokenized = raw.map(tokenize_fn, batched=True, remove_columns=raw['train'].column_names)
def group_texts(examples):
concatenated = list(chain(*examples["input_ids"]))
total_length = (len(concatenated) // CFG.block_size) * CFG.block_size
concatenated = concatenated[:total_length]
result = {
"input_ids": [concatenated[i:i+CFG.block_size] for i in range(0, total_length, CFG.block_size)]
}
result["labels"] = result["input_ids"].copy()
return result
lm_datasets = tokenized.map(group_texts, batched=True)
Tokens
- Number of sequences in train set: 899,394
- Tokens per step: 65,536
- Steps per epoch: 7,026
- Total steps: 21,078
- Total tokens processed: 1,381,367,808
Training Run & Metrics
- Total steps: 21,081
- Total FLOPs: 5.24 × 10^16
- Runtime: ~1h 44m on A100 (Colab)
- Final Train Loss: 1.8054
Loss curve snapshot (selected steps):
Step Loss
50 9.2160
100 8.2987
500 3.6695
1000 2.6862
5000 1.7699
10000 1.6385
15000 1.5620
21000 1.5140
Interpretation:
Rapid drop in loss during early steps indicates effective learning.
Final loss ≈ 1.51 suggests the model has learned coherent structure and vocabulary use for TinyStories-style text.
Inference Example
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
repo_id = "vijaymohan/gpt2-tinystories-from-scratch-10m"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(repo_id, torch_dtype=torch.float16)
if torch.cuda.is_available():
model.to("cuda")
prompt = "One day, a little girl named Lily found a needle in her"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Lessons & Recommendations for Newcomers
- Start Small — Begin with a small dataset and small model. You’ll see results quickly without burning GPU time.
- Mixed Precision (bf16/fp16) — Saves VRAM and speeds up training.
- Clean Data — High-quality datasets like TinyStories make it easier to reach good results.
- Checkpoints — Save regularly (
save_steps
) in case Colab disconnects. - Colab Session Stability — Keep your browser awake, use a stable internet connection.
- Publishing Early — Push checkpoints to Hugging Face to avoid accidental data loss.
Limitations
- Short context length (512 tokens).
- Limited generalization beyond TinyStories style/content.
- Not suitable for factual QA or large-context reasoning.
- Downloads last month
- 14
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support