The vocab size is different with embedding size?

by zhichao-geng - opened Mar 26, 2025

Mar 26, 2025

Thanks for your great work!

I found that the tokenizer.vocab_size is 30522, and the vocab token number is 30522.
However, the embeddings and cls decoder has a shape of (768,30528). Should we just ignore the last 6 coordinates in embedding?

izhx

Alibaba-NLP org Jun 8, 2025

Yes.
We pad the embedding to 30528 is to be divisible by 64, which is said could increase the computational efficiency.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment