GPT-2-Style TinyStories Model (From Scratch)

Overview

This repository contains a GPT-2–style language model trained from scratch on the roneneldan/TinyStories dataset using Hugging Face’s Transformers library on Google Colab Pro+ A100 GPU.
The training objective was to build a small, educational, and easily reproducible transformer LM for story generation.

This model is designed for:

  • Researchers exploring end-to-end LLM training workflows.
  • Beginners who want a hands-on example of training a transformer from scratch.
  • Educators demonstrating modern NLP model development without huge compute budgets.

Hardware & Environment

  • Platform: Google Colab Pro+
  • GPU: NVIDIA A100 (40 GB VRAM)
  • CPU RAM: 83.5 GB
  • Disk: 235.7 GB
  • Python: 3.x (Colab default)
  • Frameworks:
    • transformers (latest from pip)
    • datasets
    • accelerate
    • huggingface_hub

Dataset

Dataset: roneneldan/TinyStories — a curated synthetic dataset of short children’s stories.

  • Language: English
  • Cleanliness: High — minimal preprocessing needed
  • Structure: Each sample contains a single text field with a complete story

Why this dataset?

  • High signal-to-noise ratio.
  • Ideal for small models — vocabulary is modest, sentence structures are simple.
  • Useful for quick iterations and visible training convergence.

Model Architecture

A small GPT-2–like causal language model:

Hyperparameter Value
Layers (n_layer) 8
Attention Heads (n_head) 8
Embedding Dim (n_embd) 256
Vocabulary Size 16,384
Sequence Length (block_size) 512
Params (approx.) ~10–12M
Rotary Positional Embeddings Disabled
Dropout 0.0
Loss Function ForCausalLMLoss (auto-selected)

Training Setup

TrainingArguments(
    num_train_epochs = 3,
    per_device_train_batch_size = 128,
    per_device_eval_batch_size = 128,
    gradient_accumulation_steps = 1,
    learning_rate = 3e-4,
    weight_decay = 0.1,
    warmup_ratio = 0.03,
    logging_steps = 50,
    save_steps = 500,
    save_total_limit = 3,
    bf16 = True,     # Mixed precision
    fp16 = False,
    evaluation_strategy = "steps",
    eval_steps = 500,
)
  • Optimizer: AdamW (default in HF Trainer)
  • Data Loading: datasets streaming & tokenization with block_size=512
  • Collator: DataCollatorForLanguageModeling with mlm=False

Tokenization & Preprocessing

from itertools import chain

def tokenize_fn(batch):
    return tokenizer(batch["text"], add_special_tokens=False)

tokenized = raw.map(tokenize_fn, batched=True, remove_columns=raw['train'].column_names)

def group_texts(examples):
    concatenated = list(chain(*examples["input_ids"]))
    total_length = (len(concatenated) // CFG.block_size) * CFG.block_size
    concatenated = concatenated[:total_length]
    result = {
        "input_ids": [concatenated[i:i+CFG.block_size] for i in range(0, total_length, CFG.block_size)]
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized.map(group_texts, batched=True)

Tokens

  • Number of sequences in train set: 899,394
  • Tokens per step: 65,536
  • Steps per epoch: 7,026
  • Total steps: 21,078
  • Total tokens processed: 1,381,367,808

Training Run & Metrics

  • Total steps: 21,081
  • Total FLOPs: 5.24 × 10^16
  • Runtime: ~1h 44m on A100 (Colab)
  • Final Train Loss: 1.8054

Loss curve snapshot (selected steps):

Step     Loss
50       9.2160
100      8.2987
500      3.6695
1000     2.6862
5000     1.7699
10000    1.6385
15000    1.5620
21000    1.5140

Interpretation:
Rapid drop in loss during early steps indicates effective learning.
Final loss ≈ 1.51 suggests the model has learned coherent structure and vocabulary use for TinyStories-style text.


Inference Example

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo_id = "vijaymohan/gpt2-tinystories-from-scratch-10m"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(repo_id, torch_dtype=torch.float16)
if torch.cuda.is_available():
    model.to("cuda")

prompt = "One day, a little girl named Lily found a needle in her"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Lessons & Recommendations for Newcomers

  • Start Small — Begin with a small dataset and small model. You’ll see results quickly without burning GPU time.
  • Mixed Precision (bf16/fp16) — Saves VRAM and speeds up training.
  • Clean Data — High-quality datasets like TinyStories make it easier to reach good results.
  • Checkpoints — Save regularly (save_steps) in case Colab disconnects.
  • Colab Session Stability — Keep your browser awake, use a stable internet connection.
  • Publishing Early — Push checkpoints to Hugging Face to avoid accidental data loss.

Limitations

  • Short context length (512 tokens).
  • Limited generalization beyond TinyStories style/content.
  • Not suitable for factual QA or large-context reasoning.
Downloads last month
14
Safetensors
Model size
10.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train vijaymohan/gpt2-tinystories-from-scratch-10m