What do we know about the architecture so far?
Hi,
Has anyone got any info about the architecture?
I suppose it's a MoE? What's the number of total and active params?
Does it support audio or vision input?
Also, this is the chat/instruct version right?
250b~ total params, since its bf16 and 500gb total space
seems to have shared/common layers like deepseek and llama4
MoE, no vision support, from my rough calculations, it's something like ~260B-A30B?
about 270B params MoE, 115B active (2 experts out of 8). The shared FFN layers are very large. The tensors seem to be presharded for 8-fold tensor parallelism.
The Grok-2 model has a total of ~270B parameters (269,515,497,472).
Among them, the activated parameters per forward pass (since only 2 out of 8 MoE experts are selected) are ~115B (115,019,056,768).
- Total parameters: ~270B
- Activated parameters: ~115B
The following is a breakdown of the parameter counts for each component of the model.
1. Embedding layer
The embedding parameters consist of embed_tokens
and lm_head
, each with a size of 131,072 × 8,192.
2 × 131,072 × 8,192 = 2,147,483,648
2. Normalization layers
Each of the 64 transformer blocks contains four norms (pre_attn_norm
, post_attn_norm
, pre_moe_norm
, post_moe_norm
), plus one final model.norm
.
(8,192 × 4 × 64) + 8,192 = 2,101,248
3. Attention layers
Each block has four projection matrices:
q_proj
,o_proj
:8,192 × 8,192
k_proj
,v_proj
:1,024 × 8,192
Per layer:
(8,192 × 8,192 × 2) + (1,024 × 8,192 × 2) = 137,363,456
Across 64 layers:
137,363,456 × 64 = 8,791,261,184
4. Shared Feed-Forward (FFN)
Each layer has three shared projections (gate_proj
, down_proj
, up_proj
), each of size 32,768 × 8,192
.
32,768 × 8,192 × 3 = 805,306,368
Across 64 layers:
805,306,368 × 64 = 51,539,607,552
5. MoE experts
In Grok-2, each Mixture-of-Experts (MoE) layer contains 8 experts.
Each expert has three weight matrices (w1
, w2
, w3
), which correspond to the feed-forward projections inside the expert.
Because Grok-2 uses Tensor Parallelism (TP = 8), each expert is split into 8 shards, and each shard holds a fraction of the parameters.
- Parameters per expert shard (including w1, w2, w3):
2,048 × 8,192 × 3 = 50,331,648
- Per layer (8 experts, each split into 8 shards under TP=8):
50,331,648 × 8 × 8 = 3,221,225,472
- Across all 64 layers:
3,221,225,472 × 64 = 206,158,430,208
However, during a forward pass only 2 of the 8 experts in each layer are activated.
Thus the effective MoE parameters per forward pass are:
206,158,430,208 × (2/8) = 51,539,607,552
Parameter Breakdown Table
Component | Formula | Parameter Count |
---|---|---|
Embedding layer | 2 × 131,072 × 8,192 |
2,147,483,648 |
Norm layers | (8,192 × 4 × 64) + 8,192 |
2,101,248 |
Attention (64 layers) | [(8,192 × 8,192 × 2) + (1,024 × 8,192 × 2)] × 64 |
8,791,261,184 |
Shared FFN (64 layers) | (32,768 × 8,192 × 3) × 64 |
51,539,607,552 |
MoE experts (64 layers, full) | (2,048 × 8,192 × 3 × 8 × 8) × 64 |
206,158,430,208 |
↳ Activated MoE (2 of 8 experts) | [(2,048 × 8,192 × 3 × 8 × 8) × 64] × (2/8) |
51,539,607,552 |
Total parameters | — | 269,515,497,472 |
Total activated parameters | — | 115,019,056,768 |