Causal Masking used but Embedding Gemma is supposed to be using bidirectional attention

#36
by tltl123 - opened

Hi just to check, I read that Embedding Gemma uses bidirectional attention.

But from what I can see from the transformers code, it seems that a causal mask is used.

This would produce different results from an actual bidirectional attention.

Is this intended/correct?

Hi there 👋 A bidirectional mask is indeed necessary for this model (see https://huggingface.co/google/embeddinggemma-300m/blob/main/config.json#L57).

Make sure you're using the latest version of transformers, where this is indeed taken into account: https://github.com/huggingface/transformers/blob/a7f29523361b2cc12e51c1f5133d95f122f6f45c/src/transformers/models/gemma3/modeling_gemma3.py#L565

Google org

Hi @tltl123
You’ve correctly identified that the model inherits its structural DNA from a causal decoder. By default, this architecture is "wired" to prevent tokens from looking ahead. To adapt this for embeddings, we utilize a specific configuration flag called bidirectional_mask: true. This parameter acts as a global override that bypasses the standard triangular causal mask enabling an "all-to-all" attention pattern where every token can attend to every other token in the sequence .
Additionally, as mentioned by Xenova please ensure your transformers environment is on version 4.46 or higher, as these later versions contain the specific logic required to handle the bidirectional switching.
Thanks

Sign up or log in to comment