File size: 5,916 Bytes
9758396
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0d0d4f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05adc1f
 
 
 
 
 
 
 
 
0d0d4f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
language: en
license: mit
tags:
- gpt2
- causal-lm
- from-scratch
- tinystories
datasets:
- roneneldan/TinyStories
library_name: transformers
pipeline_tag: text-generation
---


# GPT-2-Style TinyStories Model (From Scratch)

## Overview
This repository contains a GPT-2–style language model trained from scratch on the [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset using Hugging Face’s Transformers library on Google Colab Pro+ A100 GPU.  
The training objective was to build a small, educational, and easily reproducible transformer LM for story generation.

**This model is designed for:**
- Researchers exploring end-to-end LLM training workflows.
- Beginners who want a hands-on example of training a transformer from scratch.
- Educators demonstrating modern NLP model development without huge compute budgets.

---
## Hardware & Environment
- **Platform**: Google Colab Pro+
- **GPU**: NVIDIA A100 (40 GB VRAM)
- **CPU RAM**: 83.5 GB
- **Disk**: 235.7 GB
- **Python**: 3.x (Colab default)
- **Frameworks**:
  - `transformers` (latest from pip)
  - `datasets`
  - `accelerate`
  - `huggingface_hub`

---
## Dataset
**Dataset**: `roneneldan/TinyStories` — a curated synthetic dataset of short children’s stories.  
- **Language**: English  
- **Cleanliness**: High — minimal preprocessing needed  
- **Structure**: Each sample contains a single text field with a complete story  

**Why this dataset?**
- High signal-to-noise ratio.
- Ideal for small models — vocabulary is modest, sentence structures are simple.
- Useful for quick iterations and visible training convergence.

---
## Model Architecture
A small GPT-2–like causal language model:

| Hyperparameter  | Value   |
|-----------------|---------|
| Layers (n_layer) | 8 |
| Attention Heads (n_head) | 8 |
| Embedding Dim (n_embd) | 256 |
| Vocabulary Size | 16,384 |
| Sequence Length (block_size) | 512 |
| Params (approx.) | ~10–12M |
| Rotary Positional Embeddings | Disabled |
| Dropout | 0.0 |
| Loss Function | ForCausalLMLoss (auto-selected) |

---
## Training Setup
```python
TrainingArguments(
    num_train_epochs = 3,
    per_device_train_batch_size = 128,
    per_device_eval_batch_size = 128,
    gradient_accumulation_steps = 1,
    learning_rate = 3e-4,
    weight_decay = 0.1,
    warmup_ratio = 0.03,
    logging_steps = 50,
    save_steps = 500,
    save_total_limit = 3,
    bf16 = True,     # Mixed precision
    fp16 = False,
    evaluation_strategy = "steps",
    eval_steps = 500,
)
```
- **Optimizer**: AdamW (default in HF Trainer)  
- **Data Loading**: `datasets` streaming & tokenization with `block_size=512`  
- **Collator**: `DataCollatorForLanguageModeling` with `mlm=False`  

---
## Tokenization & Preprocessing
```python
from itertools import chain

def tokenize_fn(batch):
    return tokenizer(batch["text"], add_special_tokens=False)

tokenized = raw.map(tokenize_fn, batched=True, remove_columns=raw['train'].column_names)

def group_texts(examples):
    concatenated = list(chain(*examples["input_ids"]))
    total_length = (len(concatenated) // CFG.block_size) * CFG.block_size
    concatenated = concatenated[:total_length]
    result = {
        "input_ids": [concatenated[i:i+CFG.block_size] for i in range(0, total_length, CFG.block_size)]
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized.map(group_texts, batched=True)
```

---

## Tokens
- **Number of sequences in train set**: 899,394
- **Tokens per step**: 65,536
- **Steps per epoch**: 7,026
- **Total steps**: 21,078
- **Total tokens processed**: 1,381,367,808

---
## Training Run & Metrics
- **Total steps**: 21,081  
- **Total FLOPs**: 5.24 × 10^16  
- **Runtime**: ~1h 44m on A100 (Colab)  
- **Final Train Loss**: 1.8054  

Loss curve snapshot (selected steps):
```yaml
Step     Loss
50       9.2160
100      8.2987
500      3.6695
1000     2.6862
5000     1.7699
10000    1.6385
15000    1.5620
21000    1.5140
```
**Interpretation**:  
Rapid drop in loss during early steps indicates effective learning.  
Final loss ≈ 1.51 suggests the model has learned coherent structure and vocabulary use for TinyStories-style text.

---
## Inference Example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo_id = "vijaymohan/gpt2-tinystories-from-scratch-10m"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(repo_id, torch_dtype=torch.float16)
if torch.cuda.is_available():
    model.to("cuda")

prompt = "One day, a little girl named Lily found a needle in her"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---
## Lessons & Recommendations for Newcomers
- **Start Small** — Begin with a small dataset and small model. You’ll see results quickly without burning GPU time.
- **Mixed Precision (bf16/fp16)** — Saves VRAM and speeds up training.
- **Clean Data** — High-quality datasets like TinyStories make it easier to reach good results.
- **Checkpoints** — Save regularly (`save_steps`) in case Colab disconnects.
- **Colab Session Stability** — Keep your browser awake, use a stable internet connection.
- **Publishing Early** — Push checkpoints to Hugging Face to avoid accidental data loss.

---
## Limitations
- Short context length (512 tokens).
- Limited generalization beyond TinyStories style/content.
- Not suitable for factual QA or large-context reasoning.