Fork of sosier's nanoGPT - Character-Level Shakespeare

This is a fork of sosier/nanoGPT-shakespear-char-tied-weights for demonstration purposes.

Quickstart

Load model:

from transformers import AutoModel
model = AutoModel.from_pretrained("n8cha/nanoGPT-shakespeare-char", trust_remote_code=True)

Setup inference:

import torch

class CharTokenizer:
    def __init__(self):
        self.token_map = {'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
        self.rev_map = {v: k for k, v in self.token_map.items()}

    def encode(self, text):
        try:
            return [self.token_map[c] for c in text]
        except KeyError as e:
            raise ValueError(f"Character not in vocabulary: {e.args[0]}")

    def decode(self, tokens):
        try:
            return ''.join(self.rev_map[t] for t in tokens)
        except KeyError as e:
            raise ValueError(f"Token not in vocabulary: {e.args[0]}")

tokenizer = CharTokenizer()

def generate(prompt):
    prompt_encoded = tokenizer.encode(prompt)
    x = (torch.tensor(prompt_encoded, dtype=torch.long, device="cpu")[None, ...])
    with torch.no_grad():
        y = model.generate(
            x,
            max_new_tokens=1000,
            temperature=0.8,
            top_k=200
        )
        return tokenizer.decode(y[0].tolist())

Run inference:

response = generate("O Romeo, Romeo, ")
print(response)

Below is the original README.


nanoGPT - Character-Level Shakespeare - Tied Weights

Small character-level, GPT-style language model trained on the works of Shakespeare using Andrej Karpathy's nanoGPT repo from my project LLMs Universally Learn a Feature Representing Token Frequency / Rarity.

Versions

This model has two versions:

  1. With tied embedding / unembedding weights (in true GPT fashion) - THIS PAGE
  2. Without tied embedding / unembedding weights

Usage

The model can be loaded using AutoModel from Hugging Face's transformers package:

>>> from transformers import AutoModel
>>> model = AutoModel.from_pretrained("n8cha/nanoGPT-shakespeare-char", trust_remote_code=True)
>>> model
number of parameters: 10.65M

NanoGPT(
  (transformer): ModuleDict(
    (wte): Embedding(65, 384)
    (wpe): Embedding(256, 384)
    (drop): Dropout(p=0.2, inplace=False)
    (h): ModuleList(
      (0-5): 6 x Block(
        (ln_1): LayerNorm()
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=384, out_features=1152, bias=False)
          (c_proj): Linear(in_features=384, out_features=384, bias=False)
          (attn_dropout): Dropout(p=0.2, inplace=False)
          (resid_dropout): Dropout(p=0.2, inplace=False)
        )
        (ln_2): LayerNorm()
        (mlp): MLP(
          (c_fc): Linear(in_features=384, out_features=1536, bias=False)
          (gelu): GELU(approximate='none')
          (c_proj): Linear(in_features=1536, out_features=384, bias=False)
          (dropout): Dropout(p=0.2, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm()
  )
  (lm_head): Linear(in_features=384, out_features=65, bias=False)
)

Training Data / Token Counts

The training data token counts can be found on my GitHub repo here and can be loaded using the instructions here.

Tokenizer

As a character-level model the tokenizer is simply a mapping for each character to its token id as given in the token counts (see section above).

Downloads last month
77
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support