What do we know about the architecture so far?

#6
by amgadhasan - opened

Hi,

Has anyone got any info about the architecture?

I suppose it's a MoE? What's the number of total and active params?

Does it support audio or vision input?

Also, this is the chat/instruct version right?

250b~ total params, since its bf16 and 500gb total space
seems to have shared/common layers like deepseek and llama4

MoE, no vision support, from my rough calculations, it's something like ~260B-A30B?

Reading from the config attached to the model repo - same architecture as Grok-1. So, 314B MoE with 8 experts and 2 active for inference. no additional capabilities (same as grok-1)

about 270B params MoE, 115B active (2 experts out of 8). The shared FFN layers are very large. The tensors seem to be presharded for 8-fold tensor parallelism.

The Grok-2 model has a total of ~270B parameters (269,515,497,472).
Among them, the activated parameters per forward pass (since only 2 out of 8 MoE experts are selected) are ~115B (115,019,056,768).

  • Total parameters: ~270B
  • Activated parameters: ~115B

The following is a breakdown of the parameter counts for each component of the model.

1. Embedding layer

The embedding parameters consist of embed_tokensand lm_head, each with a size of 131,072 × 8,192.

2 × 131,072 × 8,192 = 2,147,483,648

2. Normalization layers

Each of the 64 transformer blocks contains four norms (pre_attn_norm, post_attn_norm, pre_moe_norm, post_moe_norm), plus one final model.norm.

(8,192 × 4 × 64) + 8,192 = 2,101,248

3. Attention layers

Each block has four projection matrices:

  • q_proj, o_proj: 8,192 × 8,192
  • k_proj, v_proj: 1,024 × 8,192

Per layer:

(8,192 × 8,192 × 2) + (1,024 × 8,192 × 2) = 137,363,456

Across 64 layers:

137,363,456 × 64 = 8,791,261,184

4. Shared Feed-Forward (FFN)

Each layer has three shared projections (gate_proj, down_proj, up_proj), each of size 32,768 × 8,192.

32,768 × 8,192 × 3 = 805,306,368

Across 64 layers:

805,306,368 × 64 = 51,539,607,552

5. MoE experts

In Grok-2, each Mixture-of-Experts (MoE) layer contains 8 experts.
Each expert has three weight matrices (w1, w2, w3), which correspond to the feed-forward projections inside the expert.

Because Grok-2 uses Tensor Parallelism (TP = 8), each expert is split into 8 shards, and each shard holds a fraction of the parameters.

  • Parameters per expert shard (including w1, w2, w3):
2,048 × 8,192 × 3 = 50,331,648
  • Per layer (8 experts, each split into 8 shards under TP=8):
50,331,648 × 8 × 8 = 3,221,225,472
  • Across all 64 layers:
3,221,225,472 × 64 = 206,158,430,208

However, during a forward pass only 2 of the 8 experts in each layer are activated.
Thus the effective MoE parameters per forward pass are:

206,158,430,208 × (2/8) = 51,539,607,552

Parameter Breakdown Table

Component Formula Parameter Count
Embedding layer 2 × 131,072 × 8,192 2,147,483,648
Norm layers (8,192 × 4 × 64) + 8,192 2,101,248
Attention (64 layers) [(8,192 × 8,192 × 2) + (1,024 × 8,192 × 2)] × 64 8,791,261,184
Shared FFN (64 layers) (32,768 × 8,192 × 3) × 64 51,539,607,552
MoE experts (64 layers, full) (2,048 × 8,192 × 3 × 8 × 8) × 64 206,158,430,208
↳ Activated MoE (2 of 8 experts) [(2,048 × 8,192 × 3 × 8 × 8) × 64] × (2/8) 51,539,607,552
Total parameters 269,515,497,472
Total activated parameters 115,019,056,768

Sign up or log in to comment