RobbiePasquale commited on
Commit
cbda8b4
·
verified ·
1 Parent(s): 916381a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -0
README.md CHANGED
@@ -24,6 +24,67 @@ This model is suitable for tasks that require complex decision-making and optimi
24
 
25
  The model is constructed with several primary components:
26
  1. **Transformer**: The transformer has encoder and decoder layers with rotary positional encoding and Mixture of Experts (MoE) to improve generalization and reduce computational cost by routing only parts of the data to certain experts. GELU and SwiGLU activation functions are alternated between the experts.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  2. **Representation Network**: This module encodes the Transformer output to generate a state representation, reducing dimensionality and making it suitable for further processing.
28
  3. **Dynamics Network**: This module predicts the next state given a current state and an action. It uses layer normalization and a GELU activation function.
29
  4. **Prediction Network**: Predicts both the policy logits and value estimates for a given state. It outputs the probabilities of different actions as well as a single scalar value.
@@ -46,6 +107,9 @@ thought_1 = {P1, ... , PN}
46
 
47
  The model explores and exploits thoughts, policies, actions, and tokens, and learning happens at each step of granularity.
48
 
 
 
 
49
  ## Training Details
50
 
51
  The model is trained with the following components and techniques:
 
24
 
25
  The model is constructed with several primary components:
26
  1. **Transformer**: The transformer has encoder and decoder layers with rotary positional encoding and Mixture of Experts (MoE) to improve generalization and reduce computational cost by routing only parts of the data to certain experts. GELU and SwiGLU activation functions are alternated between the experts.
27
+
28
+ ### Multi-Token Prediction with Beam Search
29
+
30
+ Multi-token prediction in a language model involves generating multiple tokens in sequence, rather than one token at a time. This can improve the fluency and coherence of generated text by allowing the model to "look ahead" and consider multiple possible continuations at each step.
31
+
32
+ **Beam Search** is a popular decoding algorithm used for multi-token prediction that allows the model to explore multiple potential sequences and choose the most likely one based on the overall probability. Here's how it works:
33
+
34
+ 1. **Initialization**:
35
+ - Start with a single "beam" (sequence) that contains the initial token, typically the beginning-of-sequence (`<sos>`) token.
36
+
37
+ 2. **Expansion**:
38
+ - At each time step, the model generates a probability distribution over the vocabulary for each sequence in the beam.
39
+ - For each sequence, it expands by predicting the next possible tokens, creating new sequences for each possible token.
40
+
41
+ 3. **Scoring**:
42
+ - Calculate the score for each expanded sequence by taking the sum (or average) of log probabilities for all tokens in the sequence. Log probabilities are used to avoid underflow and ensure stable computation.
43
+
44
+ 4. **Selection**:
45
+ - Keep only the top-k sequences with the highest scores (known as the "beam width" or "beam size") and discard the rest. This limits the number of sequences kept at each step, focusing only on the most promising ones.
46
+
47
+ 5. **Repeat**:
48
+ - Continue expanding and scoring until reaching the desired sequence length or the end-of-sequence (`<eos>`) token.
49
+
50
+ 6. **Final Output**:
51
+ - After a set number of steps, or if all sequences end with `<eos>`, select the sequence with the highest score as the final output.
52
+
53
+ This process allows the model to generate more fluent and accurate sequences by considering multiple potential continuations at each step and selecting the best overall sequence.
54
+
55
+ ---
56
+
57
+ ### Brief Overview of the Transformer Architecture
58
+
59
+ The Transformer architecture, introduced in the paper "Attention is All You Need," is a powerful neural network design for handling sequential data, especially in natural language processing tasks. Transformers are known for their parallelism and ability to capture long-range dependencies in data.
60
+
61
+ #### Key Components of the Transformer
62
+
63
+ 1. **Embeddings and Positional Encoding**:
64
+ - The input tokens are embedded into dense vectors. Since Transformers do not inherently encode the sequence order (as opposed to RNNs), they require **positional encodings**. These encodings are added to the embeddings to provide information about the token positions in the sequence.
65
+
66
+ 2. **Multi-Head Self-Attention**:
67
+ - Each token in a sequence attends to every other token, capturing dependencies regardless of distance. Multiple attention heads allow the model to focus on different parts of the sequence, extracting varied features.
68
+ - In self-attention, the model computes **query**, **key**, and **value** vectors for each token. The output is a weighted sum of values, where the weights are determined by the similarity between the query and key vectors.
69
+
70
+ 3. **Feedforward Neural Networks**:
71
+ - After self-attention, a position-wise feedforward neural network is applied to each token independently. This network consists of two linear layers with a ReLU or GELU activation function in between.
72
+
73
+ 4. **Layer Normalization and Residual Connections**:
74
+ - To improve learning stability, **layer normalization** is applied. Residual connections help the model to learn effectively by adding the input of a layer to its output, allowing gradients to flow more easily during backpropagation.
75
+
76
+ 5. **Stacking of Layers**:
77
+ - The Transformer consists of **multiple encoder and decoder layers**. Each encoder layer is identical and consists of self-attention and feedforward layers. The decoder layers include an additional cross-attention mechanism to attend to the encoder's output.
78
+
79
+ 6. **Final Linear and Softmax Layer**:
80
+ - The final output of the decoder layer is passed through a linear layer, projecting it onto the vocabulary size. A **softmax** function then converts the output into a probability distribution over the vocabulary, from which the next token is selected or sampled.
81
+
82
+ #### Encoder-Decoder Structure
83
+
84
+ - **Encoder**: The encoder processes the input sequence into a contextualized representation that captures relationships between tokens. It consists of multiple layers of self-attention and feedforward networks.
85
+ - **Decoder**: The decoder generates the output sequence by attending to both the encoded input representation (using cross-attention) and previously generated tokens (using self-attention). The decoder's output is used to predict the next token in the sequence.
86
+
87
+
88
  2. **Representation Network**: This module encodes the Transformer output to generate a state representation, reducing dimensionality and making it suitable for further processing.
89
  3. **Dynamics Network**: This module predicts the next state given a current state and an action. It uses layer normalization and a GELU activation function.
90
  4. **Prediction Network**: Predicts both the policy logits and value estimates for a given state. It outputs the probabilities of different actions as well as a single scalar value.
 
107
 
108
  The model explores and exploits thoughts, policies, actions, and tokens, and learning happens at each step of granularity.
109
 
110
+
111
+
112
+
113
  ## Training Details
114
 
115
  The model is trained with the following components and techniques: