Causal Masking used but Embedding Gemma is supposed to be using bidirectional attention

#36

by tltl123 - opened 12 days ago

Discussion

tltl123

12 days ago

Hi just to check, I read that Embedding Gemma uses bidirectional attention.

But from what I can see from the transformers code, it seems that a causal mask is used.

This would produce different results from an actual bidirectional attention.

Is this intended/correct?

Xenova

11 days ago

Hi there 👋 A bidirectional mask is indeed necessary for this model (see https://huggingface.co/google/embeddinggemma-300m/blob/main/config.json#L57).

Make sure you're using the latest version of transformers, where this is indeed taken into account: https://github.com/huggingface/transformers/blob/a7f29523361b2cc12e51c1f5133d95f122f6f45c/src/transformers/models/gemma3/modeling_gemma3.py#L565

pannaga10

Google org 4 days ago

Hi @tltl123
You’ve correctly identified that the model inherits its structural DNA from a causal decoder. By default, this architecture is "wired" to prevent tokens from looking ahead. To adapt this for embeddings, we utilize a specific configuration flag called bidirectional_mask: true. This parameter acts as a global override that bypasses the standard triangular causal mask enabling an "all-to-all" attention pattern where every token can attend to every other token in the sequence .
Additionally, as mentioned by Xenova please ensure your transformers environment is on version 4.46 or higher, as these later versions contain the specific logic required to handle the bidirectional switching.
Thanks

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Causal Masking used but Embedding Gemma is supposed to be using bidirectional attention

🎉 Free Image Generator Now Available!