n8cha's picture
quickstart steps:
25e74a4
metadata
language: en

Fork of sosier's nanoGPT - Character-Level Shakespeare

This is a fork of sosier/nanoGPT-shakespear-char-tied-weights for demonstration purposes.

Quickstart

Load model:

from transformers import AutoModel
model = AutoModel.from_pretrained("n8cha/nanoGPT-shakespeare-char", trust_remote_code=True)

Setup inference:

import torch

class CharTokenizer:
    def __init__(self):
        self.token_map = {'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
        self.rev_map = {v: k for k, v in self.token_map.items()}

    def encode(self, text):
        try:
            return [self.token_map[c] for c in text]
        except KeyError as e:
            raise ValueError(f"Character not in vocabulary: {e.args[0]}")

    def decode(self, tokens):
        try:
            return ''.join(self.rev_map[t] for t in tokens)
        except KeyError as e:
            raise ValueError(f"Token not in vocabulary: {e.args[0]}")

tokenizer = CharTokenizer()

def generate(prompt):
    prompt_encoded = tokenizer.encode(prompt)
    x = (torch.tensor(prompt_encoded, dtype=torch.long, device="cpu")[None, ...])
    with torch.no_grad():
        y = model.generate(
            x,
            max_new_tokens=1000,
            temperature=0.8,
            top_k=200
        )
        return tokenizer.decode(y[0].tolist())

Run inference:

response = generate("O Romeo, Romeo, ")
print(response)

Below is the original README.


nanoGPT - Character-Level Shakespeare - Tied Weights

Small character-level, GPT-style language model trained on the works of Shakespeare using Andrej Karpathy's nanoGPT repo from my project LLMs Universally Learn a Feature Representing Token Frequency / Rarity.

Versions

This model has two versions:

  1. With tied embedding / unembedding weights (in true GPT fashion) - THIS PAGE
  2. Without tied embedding / unembedding weights

Usage

The model can be loaded using AutoModel from Hugging Face's transformers package:

>>> from transformers import AutoModel
>>> model = AutoModel.from_pretrained("n8cha/nanoGPT-shakespeare-char", trust_remote_code=True)
>>> model
number of parameters: 10.65M

NanoGPT(
  (transformer): ModuleDict(
    (wte): Embedding(65, 384)
    (wpe): Embedding(256, 384)
    (drop): Dropout(p=0.2, inplace=False)
    (h): ModuleList(
      (0-5): 6 x Block(
        (ln_1): LayerNorm()
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=384, out_features=1152, bias=False)
          (c_proj): Linear(in_features=384, out_features=384, bias=False)
          (attn_dropout): Dropout(p=0.2, inplace=False)
          (resid_dropout): Dropout(p=0.2, inplace=False)
        )
        (ln_2): LayerNorm()
        (mlp): MLP(
          (c_fc): Linear(in_features=384, out_features=1536, bias=False)
          (gelu): GELU(approximate='none')
          (c_proj): Linear(in_features=1536, out_features=384, bias=False)
          (dropout): Dropout(p=0.2, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm()
  )
  (lm_head): Linear(in_features=384, out_features=65, bias=False)
)

Training Data / Token Counts

The training data token counts can be found on my GitHub repo here and can be loaded using the instructions here.

Tokenizer

As a character-level model the tokenizer is simply a mapping for each character to its token id as given in the token counts (see section above).