--- language: en license: mit tags: - gpt2 - causal-lm - from-scratch - tinystories datasets: - roneneldan/TinyStories library_name: transformers pipeline_tag: text-generation --- # GPT-2-Style TinyStories Model (From Scratch) ## Overview This repository contains a GPT-2–style language model trained from scratch on the [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset using Hugging Face’s Transformers library on Google Colab Pro+ A100 GPU. The training objective was to build a small, educational, and easily reproducible transformer LM for story generation. **This model is designed for:** - Researchers exploring end-to-end LLM training workflows. - Beginners who want a hands-on example of training a transformer from scratch. - Educators demonstrating modern NLP model development without huge compute budgets. --- ## Hardware & Environment - **Platform**: Google Colab Pro+ - **GPU**: NVIDIA A100 (40 GB VRAM) - **CPU RAM**: 83.5 GB - **Disk**: 235.7 GB - **Python**: 3.x (Colab default) - **Frameworks**: - `transformers` (latest from pip) - `datasets` - `accelerate` - `huggingface_hub` --- ## Dataset **Dataset**: `roneneldan/TinyStories` — a curated synthetic dataset of short children’s stories. - **Language**: English - **Cleanliness**: High — minimal preprocessing needed - **Structure**: Each sample contains a single text field with a complete story **Why this dataset?** - High signal-to-noise ratio. - Ideal for small models — vocabulary is modest, sentence structures are simple. - Useful for quick iterations and visible training convergence. --- ## Model Architecture A small GPT-2–like causal language model: | Hyperparameter | Value | |-----------------|---------| | Layers (n_layer) | 8 | | Attention Heads (n_head) | 8 | | Embedding Dim (n_embd) | 256 | | Vocabulary Size | 16,384 | | Sequence Length (block_size) | 512 | | Params (approx.) | ~10–12M | | Rotary Positional Embeddings | Disabled | | Dropout | 0.0 | | Loss Function | ForCausalLMLoss (auto-selected) | --- ## Training Setup ```python TrainingArguments( num_train_epochs = 3, per_device_train_batch_size = 128, per_device_eval_batch_size = 128, gradient_accumulation_steps = 1, learning_rate = 3e-4, weight_decay = 0.1, warmup_ratio = 0.03, logging_steps = 50, save_steps = 500, save_total_limit = 3, bf16 = True, # Mixed precision fp16 = False, evaluation_strategy = "steps", eval_steps = 500, ) ``` - **Optimizer**: AdamW (default in HF Trainer) - **Data Loading**: `datasets` streaming & tokenization with `block_size=512` - **Collator**: `DataCollatorForLanguageModeling` with `mlm=False` --- ## Tokenization & Preprocessing ```python from itertools import chain def tokenize_fn(batch): return tokenizer(batch["text"], add_special_tokens=False) tokenized = raw.map(tokenize_fn, batched=True, remove_columns=raw['train'].column_names) def group_texts(examples): concatenated = list(chain(*examples["input_ids"])) total_length = (len(concatenated) // CFG.block_size) * CFG.block_size concatenated = concatenated[:total_length] result = { "input_ids": [concatenated[i:i+CFG.block_size] for i in range(0, total_length, CFG.block_size)] } result["labels"] = result["input_ids"].copy() return result lm_datasets = tokenized.map(group_texts, batched=True) ``` --- ## Tokens - **Number of sequences in train set**: 899,394 - **Tokens per step**: 65,536 - **Steps per epoch**: 7,026 - **Total steps**: 21,078 - **Total tokens processed**: 1,381,367,808 --- ## Training Run & Metrics - **Total steps**: 21,081 - **Total FLOPs**: 5.24 × 10^16 - **Runtime**: ~1h 44m on A100 (Colab) - **Final Train Loss**: 1.8054 Loss curve snapshot (selected steps): ```yaml Step Loss 50 9.2160 100 8.2987 500 3.6695 1000 2.6862 5000 1.7699 10000 1.6385 15000 1.5620 21000 1.5140 ``` **Interpretation**: Rapid drop in loss during early steps indicates effective learning. Final loss ≈ 1.51 suggests the model has learned coherent structure and vocabulary use for TinyStories-style text. --- ## Inference Example ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch repo_id = "vijaymohan/gpt2-tinystories-from-scratch-10m" tokenizer = AutoTokenizer.from_pretrained(repo_id) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token model = AutoModelForCausalLM.from_pretrained(repo_id, torch_dtype=torch.float16) if torch.cuda.is_available(): model.to("cuda") prompt = "One day, a little girl named Lily found a needle in her" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.inference_mode(): outputs = model.generate( **inputs, max_new_tokens=100, do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.1, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## Lessons & Recommendations for Newcomers - **Start Small** — Begin with a small dataset and small model. You’ll see results quickly without burning GPU time. - **Mixed Precision (bf16/fp16)** — Saves VRAM and speeds up training. - **Clean Data** — High-quality datasets like TinyStories make it easier to reach good results. - **Checkpoints** — Save regularly (`save_steps`) in case Colab disconnects. - **Colab Session Stability** — Keep your browser awake, use a stable internet connection. - **Publishing Early** — Push checkpoints to Hugging Face to avoid accidental data loss. --- ## Limitations - Short context length (512 tokens). - Limited generalization beyond TinyStories style/content. - Not suitable for factual QA or large-context reasoning.