Causal Masking used but Embedding Gemma is supposed to be using bidirectional attention
Hi just to check, I read that Embedding Gemma uses bidirectional attention.
But from what I can see from the transformers code, it seems that a causal mask is used.
This would produce different results from an actual bidirectional attention.
Is this intended/correct?
Hi there 👋 A bidirectional mask is indeed necessary for this model (see https://huggingface.co/google/embeddinggemma-300m/blob/main/config.json#L57).
Make sure you're using the latest version of transformers, where this is indeed taken into account: https://github.com/huggingface/transformers/blob/a7f29523361b2cc12e51c1f5133d95f122f6f45c/src/transformers/models/gemma3/modeling_gemma3.py#L565
Hi
@tltl123
You’ve correctly identified that the model inherits its structural DNA from a causal decoder. By default, this architecture is "wired" to prevent tokens from looking ahead. To adapt this for embeddings, we utilize a specific configuration flag called bidirectional_mask: true. This parameter acts as a global override that bypasses the standard triangular causal mask enabling an "all-to-all" attention pattern where every token can attend to every other token in the sequence .
Additionally, as mentioned by Xenova please ensure your transformers environment is on version 4.46 or higher, as these later versions contain the specific logic required to handle the bidirectional switching.
Thanks