vijaymohan
/

gpt2-tinystories-from-scratch-10m

+# GPT-2-Style TinyStories Model (From Scratch)
+## Overview
+This repository contains a GPT-2–style language model trained from scratch on the [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset using Hugging Face’s Transformers library on Google Colab Pro+ A100 GPU.
+The training objective was to build a small, educational, and easily reproducible transformer LM for story generation.
+**This model is designed for:**
+- Researchers exploring end-to-end LLM training workflows.
+- Beginners who want a hands-on example of training a transformer from scratch.
+- Educators demonstrating modern NLP model development without huge compute budgets.
+---
+## Hardware & Environment
+- **Platform**: Google Colab Pro+
+- **GPU**: NVIDIA A100 (40 GB VRAM)
+- **CPU RAM**: 83.5 GB
+- **Disk**: 235.7 GB
+- **Python**: 3.x (Colab default)
+- **Frameworks**:
+  - `transformers` (latest from pip)
+  - `datasets`
+  - `accelerate`
+  - `huggingface_hub`
+---
+## Dataset
+**Dataset**: `roneneldan/TinyStories` — a curated synthetic dataset of short children’s stories.
+- **Language**: English
+- **Cleanliness**: High — minimal preprocessing needed
+- **Structure**: Each sample contains a single text field with a complete story
+**Why this dataset?**
+- High signal-to-noise ratio.
+- Ideal for small models — vocabulary is modest, sentence structures are simple.
+- Useful for quick iterations and visible training convergence.
+---
+## Model Architecture
+A small GPT-2–like causal language model:
+| Hyperparameter  | Value   |
+|-----------------|---------|
+| Layers (n_layer) | 8 |
+| Attention Heads (n_head) | 8 |
+| Embedding Dim (n_embd) | 256 |
+| Vocabulary Size | 16,384 |
+| Sequence Length (block_size) | 512 |
+| Params (approx.) | ~10–12M |
+| Rotary Positional Embeddings | Disabled |
+| Dropout | 0.0 |
+| Loss Function | ForCausalLMLoss (auto-selected) |
+---
+## Training Setup
+```python
+TrainingArguments(
+    num_train_epochs = 3,
+    per_device_train_batch_size = 128,
+    per_device_eval_batch_size = 128,
+    gradient_accumulation_steps = 1,
+    learning_rate = 3e-4,
+    weight_decay = 0.1,
+    warmup_ratio = 0.03,
+    logging_steps = 50,
+    save_steps = 500,
+    save_total_limit = 3,
+    bf16 = True,     # Mixed precision
+    fp16 = False,
+    evaluation_strategy = "steps",
+    eval_steps = 500,
+)
+```
+- **Optimizer**: AdamW (default in HF Trainer)
+- **Data Loading**: `datasets` streaming & tokenization with `block_size=512`
+- **Collator**: `DataCollatorForLanguageModeling` with `mlm=False`
+---
+## Tokenization & Preprocessing
+```python
+from itertools import chain
+def tokenize_fn(batch):
+    return tokenizer(batch["text"], add_special_tokens=False)
+tokenized = raw.map(tokenize_fn, batched=True, remove_columns=raw['train'].column_names)
+def group_texts(examples):
+    concatenated = list(chain(*examples["input_ids"]))
+    total_length = (len(concatenated) // CFG.block_size) * CFG.block_size
+    concatenated = concatenated[:total_length]
+    result = {
+        "input_ids": [concatenated[i:i+CFG.block_size] for i in range(0, total_length, CFG.block_size)]
+    }
+    result["labels"] = result["input_ids"].copy()
+    return result
+lm_datasets = tokenized.map(group_texts, batched=True)
+```
+---
+## Training Run & Metrics
+- **Total steps**: 21,081
+- **Total FLOPs**: 5.24 × 10^16
+- **Runtime**: ~1h 44m on A100 (Colab)
+- **Final Train Loss**: 1.8054
+Loss curve snapshot (selected steps):
+```yaml
+Step     Loss
+50       9.2160
+100      8.2987
+500      3.6695
+1000     2.6862
+5000     1.7699
+10000    1.6385
+15000    1.5620
+21000    1.5140
+```
+**Interpretation**:
+Rapid drop in loss during early steps indicates effective learning.
+Final loss ≈ 1.51 suggests the model has learned coherent structure and vocabulary use for TinyStories-style text.
+---
+## Inference Example
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+repo_id = "vijaymohan/gpt2-tinystories-from-scratch-10m"
+tokenizer = AutoTokenizer.from_pretrained(repo_id)
+if tokenizer.pad_token is None:
+    tokenizer.pad_token = tokenizer.eos_token
+model = AutoModelForCausalLM.from_pretrained(repo_id, torch_dtype=torch.float16)
+if torch.cuda.is_available():
+    model.to("cuda")
+prompt = "One day, a little girl named Lily found a needle in her"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+with torch.inference_mode():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=100,
+        do_sample=True,
+        temperature=0.7,
+        top_p=0.9,
+        repetition_penalty=1.1,
+        eos_token_id=tokenizer.eos_token_id,
+        pad_token_id=tokenizer.pad_token_id
+    )
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+---
+## Lessons & Recommendations for Newcomers
+- **Start Small** — Begin with a small dataset and small model. You’ll see results quickly without burning GPU time.
+- **Mixed Precision (bf16/fp16)** — Saves VRAM and speeds up training.
+- **Clean Data** — High-quality datasets like TinyStories make it easier to reach good results.
+- **Checkpoints** — Save regularly (`save_steps`) in case Colab disconnects.
+- **Colab Session Stability** — Keep your browser awake, use a stable internet connection.
+- **Publishing Early** — Push checkpoints to Hugging Face to avoid accidental data loss.
+---
+## Limitations
+- Short context length (512 tokens).
+- Limited generalization beyond TinyStories style/content.
+- Not suitable for factual QA or large-context reasoning.