gpt2-tinystories-from-scratch-10m / README.md

Update README.md

05adc1f verified 16 days ago

5.92 kB

	---
	language: en
	license: mit
	tags:
	- gpt2
	- causal-lm
	- from-scratch
	- tinystories
	datasets:
	- roneneldan/TinyStories
	library_name: transformers
	pipeline_tag: text-generation
	---


	# GPT-2-Style TinyStories Model (From Scratch)

	## Overview
	This repository contains a GPT-2–style language model trained from scratch on the [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset using Hugging Face’s Transformers library on Google Colab Pro+ A100 GPU.
	The training objective was to build a small, educational, and easily reproducible transformer LM for story generation.

	This model is designed for:
	- Researchers exploring end-to-end LLM training workflows.
	- Beginners who want a hands-on example of training a transformer from scratch.
	- Educators demonstrating modern NLP model development without huge compute budgets.

	---
	## Hardware & Environment
	- Platform: Google Colab Pro+
	- GPU: NVIDIA A100 (40 GB VRAM)
	- CPU RAM: 83.5 GB
	- Disk: 235.7 GB
	- Python: 3.x (Colab default)
	- Frameworks:
	- `transformers` (latest from pip)
	- `datasets`
	- `accelerate`
	- `huggingface_hub`

	---
	## Dataset
	Dataset: `roneneldan/TinyStories` — a curated synthetic dataset of short children’s stories.
	- Language: English
	- Cleanliness: High — minimal preprocessing needed
	- Structure: Each sample contains a single text field with a complete story

	Why this dataset?
	- High signal-to-noise ratio.
	- Ideal for small models — vocabulary is modest, sentence structures are simple.
	- Useful for quick iterations and visible training convergence.

	---
	## Model Architecture
	A small GPT-2–like causal language model:

	\| Hyperparameter \| Value \|
	\|-----------------\|---------\|
	\| Layers (n_layer) \| 8 \|
	\| Attention Heads (n_head) \| 8 \|
	\| Embedding Dim (n_embd) \| 256 \|
	\| Vocabulary Size \| 16,384 \|
	\| Sequence Length (block_size) \| 512 \|
	\| Params (approx.) \| ~10–12M \|
	\| Rotary Positional Embeddings \| Disabled \|
	\| Dropout \| 0.0 \|
	\| Loss Function \| ForCausalLMLoss (auto-selected) \|

	---
	## Training Setup
	```python
	TrainingArguments(
	num_train_epochs = 3,
	per_device_train_batch_size = 128,
	per_device_eval_batch_size = 128,
	gradient_accumulation_steps = 1,
	learning_rate = 3e-4,
	weight_decay = 0.1,
	warmup_ratio = 0.03,
	logging_steps = 50,
	save_steps = 500,
	save_total_limit = 3,
	bf16 = True, # Mixed precision
	fp16 = False,
	evaluation_strategy = "steps",
	eval_steps = 500,
	)
	```
	- Optimizer: AdamW (default in HF Trainer)
	- Data Loading: `datasets` streaming & tokenization with `block_size=512`
	- Collator: `DataCollatorForLanguageModeling` with `mlm=False`

	---
	## Tokenization & Preprocessing
	```python
	from itertools import chain

	def tokenize_fn(batch):
	return tokenizer(batch["text"], add_special_tokens=False)

	tokenized = raw.map(tokenize_fn, batched=True, remove_columns=raw['train'].column_names)

	def group_texts(examples):
	concatenated = list(chain(*examples["input_ids"]))
	total_length = (len(concatenated) // CFG.block_size) * CFG.block_size
	concatenated = concatenated[:total_length]
	result = {
	"input_ids": [concatenated[i:i+CFG.block_size] for i in range(0, total_length, CFG.block_size)]
	}
	result["labels"] = result["input_ids"].copy()
	return result

	lm_datasets = tokenized.map(group_texts, batched=True)
	```

	---

	## Tokens
	- Number of sequences in train set: 899,394
	- Tokens per step: 65,536
	- Steps per epoch: 7,026
	- Total steps: 21,078
	- Total tokens processed: 1,381,367,808

	---
	## Training Run & Metrics
	- Total steps: 21,081
	- Total FLOPs: 5.24 × 10^16
	- Runtime: ~1h 44m on A100 (Colab)
	- Final Train Loss: 1.8054

	Loss curve snapshot (selected steps):
	```yaml
	Step Loss
	50 9.2160
	100 8.2987
	500 3.6695
	1000 2.6862
	5000 1.7699
	10000 1.6385
	15000 1.5620
	21000 1.5140
	```
	Interpretation:
	Rapid drop in loss during early steps indicates effective learning.
	Final loss ≈ 1.51 suggests the model has learned coherent structure and vocabulary use for TinyStories-style text.

	---
	## Inference Example
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	repo_id = "vijaymohan/gpt2-tinystories-from-scratch-10m"
	tokenizer = AutoTokenizer.from_pretrained(repo_id)
	if tokenizer.pad_token is None:
	tokenizer.pad_token = tokenizer.eos_token

	model = AutoModelForCausalLM.from_pretrained(repo_id, torch_dtype=torch.float16)
	if torch.cuda.is_available():
	model.to("cuda")

	prompt = "One day, a little girl named Lily found a needle in her"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	with torch.inference_mode():
	outputs = model.generate(
	**inputs,
	max_new_tokens=100,
	do_sample=True,
	temperature=0.7,
	top_p=0.9,
	repetition_penalty=1.1,
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=tokenizer.pad_token_id
	)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	---
	## Lessons & Recommendations for Newcomers
	- Start Small — Begin with a small dataset and small model. You’ll see results quickly without burning GPU time.
	- Mixed Precision (bf16/fp16) — Saves VRAM and speeds up training.
	- Clean Data — High-quality datasets like TinyStories make it easier to reach good results.
	- Checkpoints — Save regularly (`save_steps`) in case Colab disconnects.
	- Colab Session Stability — Keep your browser awake, use a stable internet connection.
	- Publishing Early — Push checkpoints to Hugging Face to avoid accidental data loss.

	---
	## Limitations
	- Short context length (512 tokens).
	- Limited generalization beyond TinyStories style/content.
	- Not suitable for factual QA or large-context reasoning.