nasselt48 commited on
Commit
f9ddfe4
·
verified ·
1 Parent(s): 0bceb1f

Upload 13 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,529 @@
1
- ---
2
- license: gemma
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ extra_gated_heading: Access Gemma on Hugging Face
6
+ extra_gated_prompt: >-
7
+ To access Gemma on Hugging Face, you’re required to review and agree to
8
+ Google’s usage license. To do this, please ensure you’re logged in to Hugging
9
+ Face and click below. Requests are processed immediately.
10
+ extra_gated_button_content: Acknowledge license
11
+ base_model: google/gemma-3n-e4b
12
+ tags:
13
+ - automatic-speech-recognition
14
+ - automatic-speech-translation
15
+ - audio-text-to-text
16
+ - video-text-to-text
17
+ ---
18
+
19
+ > [!Note]
20
+ > This repository corresponds to the launch version of Gemma 3n E4B IT (Instruct), to be used with Hugging Face `transformers`,
21
+ > supporting text, audio, and vision (image and video) inputs.
22
+ >
23
+ > Gemma 3n models have multiple architecture innovations:
24
+ > * They are available in two sizes based on [effective parameters](https://ai.google.dev/gemma/docs/gemma-3n#parameters). While the raw parameter count of this model is 8B, the architecture design allows the model to be run with a memory footprint comparable to a traditional 4B model by offloading low-utilization matrices from the accelerator.
25
+ > * They use a MatFormer architecture that allows nesting sub-models within the E4B model. We provide one sub-model (an [E2B](https://huggingface.co/google/gemma-3n-E2B-it)), or you can access a spectrum of custom-sized models using the [Mix-and-Match method](https://goo.gle/gemma3n-matformer-lab).
26
+ >
27
+ > Learn more about these techniques in the [technical blog post](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide)
28
+ > and the [Gemma documentation](https://ai.google.dev/gemma/docs/gemma-3n).
29
+
30
+ # Gemma 3n model card
31
+
32
+ **Model Page**: [Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n)
33
+
34
+ **Resources and Technical Documentation**:
35
+
36
+ - [Responsible Generative AI Toolkit](https://ai.google.dev/responsible)
37
+ - [Gemma on Kaggle](https://www.kaggle.com/models/google/gemma-3n)
38
+ - [Gemma on HuggingFace](https://huggingface.co/collections/google/gemma-3n-685065323f5984ef315c93f4)
39
+ - [Gemma on Vertex Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3n)
40
+
41
+ **Terms of Use**: [Terms](https://ai.google.dev/gemma/terms)\
42
+ **Authors**: Google DeepMind
43
+
44
+ ## Model Information
45
+
46
+ Summary description and brief definition of inputs and outputs.
47
+
48
+ ### Description
49
+
50
+ Gemma is a family of lightweight, state-of-the-art open models from Google,
51
+ built from the same research and technology used to create the Gemini models.
52
+ Gemma 3n models are designed for efficient execution on low-resource devices.
53
+ They are capable of multimodal input, handling text, image, video, and audio
54
+ input, and generating text outputs, with open weights for pre-trained and
55
+ instruction-tuned variants. These models were trained with data in over 140
56
+ spoken languages.
57
+
58
+ Gemma 3n models use selective parameter activation technology to reduce resource
59
+ requirements. This technique allows the models to operate at an effective size
60
+ of 2B and 4B parameters, which is lower than the total number of parameters they
61
+ contain. For more information on Gemma 3n's efficient parameter management
62
+ technology, see the
63
+ [Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n#parameters)
64
+ page.
65
+
66
+ ### Inputs and outputs
67
+
68
+ - **Input:**
69
+ - Text string, such as a question, a prompt, or a document to be
70
+ summarized
71
+ - Images, normalized to 256x256, 512x512, or 768x768 resolution
72
+ and encoded to 256 tokens each
73
+ - Audio data encoded to 6.25 tokens per second from a single channel
74
+ - Total input context of 32K tokens
75
+ - **Output:**
76
+ - Generated text in response to the input, such as an answer to a
77
+ question, analysis of image content, or a summary of a document
78
+ - Total output length up to 32K tokens, subtracting the request
79
+ input tokens
80
+
81
+ ### Usage
82
+
83
+ Below, there are some code snippets on how to get quickly started with running
84
+ the model. First, install the Transformers library. Gemma 3n is supported
85
+ starting from transformers 4.53.0.
86
+
87
+ ```sh
88
+ $ pip install -U transformers
89
+ ```
90
+
91
+ Then, copy the snippet from the section that is relevant for your use case.
92
+
93
+ #### Running with the `pipeline` API
94
+
95
+ You can initialize the model and processor for inference with `pipeline` as
96
+ follows.
97
+
98
+ ```python
99
+ from transformers import pipeline
100
+ import torch
101
+
102
+ pipe = pipeline(
103
+ "image-text-to-text",
104
+ model="google/gemma-3n-e4b-it",
105
+ device="cuda",
106
+ torch_dtype=torch.bfloat16,
107
+ )
108
+ ```
109
+
110
+ With instruction-tuned models, you need to use chat templates to process our
111
+ inputs first. Then, you can pass it to the pipeline.
112
+
113
+ ```python
114
+ messages = [
115
+ {
116
+ "role": "system",
117
+ "content": [{"type": "text", "text": "You are a helpful assistant."}]
118
+ },
119
+ {
120
+ "role": "user",
121
+ "content": [
122
+ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
123
+ {"type": "text", "text": "What animal is on the candy?"}
124
+ ]
125
+ }
126
+ ]
127
+
128
+ output = pipe(text=messages, max_new_tokens=200)
129
+ print(output[0]["generated_text"][-1]["content"])
130
+ # Okay, let's take a look!
131
+ # Based on the image, the animal on the candy is a **turtle**.
132
+ # You can see the shell shape and the head and legs.
133
+ ```
134
+
135
+ #### Running the model on a single GPU
136
+
137
+ ```python
138
+ from transformers import AutoProcessor, Gemma3nForConditionalGeneration
139
+ from PIL import Image
140
+ import requests
141
+ import torch
142
+
143
+ model_id = "google/gemma-3n-e4b-it"
144
+
145
+ model = Gemma3nForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16,).eval()
146
+
147
+ processor = AutoProcessor.from_pretrained(model_id)
148
+
149
+ messages = [
150
+ {
151
+ "role": "system",
152
+ "content": [{"type": "text", "text": "You are a helpful assistant."}]
153
+ },
154
+ {
155
+ "role": "user",
156
+ "content": [
157
+ {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
158
+ {"type": "text", "text": "Describe this image in detail."}
159
+ ]
160
+ }
161
+ ]
162
+
163
+ inputs = processor.apply_chat_template(
164
+ messages,
165
+ add_generation_prompt=True,
166
+ tokenize=True,
167
+ return_dict=True,
168
+ return_tensors="pt",
169
+ ).to(model.device)
170
+
171
+ input_len = inputs["input_ids"].shape[-1]
172
+
173
+ with torch.inference_mode():
174
+ generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
175
+ generation = generation[0][input_len:]
176
+
177
+ decoded = processor.decode(generation, skip_special_tokens=True)
178
+ print(decoded)
179
+
180
+ # **Overall Impression:** The image is a close-up shot of a vibrant garden scene,
181
+ # focusing on a cluster of pink cosmos flowers and a busy bumblebee.
182
+ # It has a slightly soft, natural feel, likely captured in daylight.
183
+ ```
184
+
185
+ ### Citation
186
+
187
+ ```
188
+ @article{gemma_3n_2025,
189
+ title={Gemma 3n},
190
+ url={https://ai.google.dev/gemma/docs/gemma-3n},
191
+ publisher={Google DeepMind},
192
+ author={Gemma Team},
193
+ year={2025}
194
+ }
195
+ ```
196
+
197
+ ## Model Data
198
+
199
+ Data used for model training and how the data was processed.
200
+
201
+ ### Training Dataset
202
+
203
+ These models were trained on a dataset that includes a wide variety of sources
204
+ totalling approximately 11 trillion tokens. The knowledge cutoff date for the
205
+ training data was June 2024. Here are the key components:
206
+
207
+ - **Web Documents**: A diverse collection of web text ensures the model
208
+ is exposed to a broad range of linguistic styles, topics, and vocabulary.
209
+ The training dataset includes content in over 140 languages.
210
+ - **Code**: Exposing the model to code helps it to learn the syntax and
211
+ patterns of programming languages, which improves its ability to generate
212
+ code and understand code-related questions.
213
+ - **Mathematics**: Training on mathematical text helps the model learn
214
+ logical reasoning, symbolic representation, and to address mathematical queries.
215
+ - **Images**: A wide range of images enables the model to perform image
216
+ analysis and visual data extraction tasks.
217
+ - Audio: A diverse set of sound samples enables the model to recognize
218
+ speech, transcribe text from recordings, and identify information in audio data.
219
+
220
+ The combination of these diverse data sources is crucial for training a
221
+ powerful multimodal model that can handle a wide variety of different tasks and
222
+ data formats.
223
+
224
+ ### Data Preprocessing
225
+
226
+ Here are the key data cleaning and filtering methods applied to the training
227
+ data:
228
+
229
+ - **CSAM Filtering**: Rigorous CSAM (Child Sexual Abuse Material)
230
+ filtering was applied at multiple stages in the data preparation process to
231
+ ensure the exclusion of harmful and illegal content.
232
+ - **Sensitive Data Filtering**: As part of making Gemma pre-trained models
233
+ safe and reliable, automated techniques were used to filter out certain
234
+ personal information and other sensitive data from training sets.
235
+ - **Additional methods**: Filtering based on content quality and safety in
236
+ line with
237
+ [our policies](https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf).
238
+
239
+ ## Implementation Information
240
+
241
+ Details about the model internals.
242
+
243
+ ### Hardware
244
+
245
+ Gemma was trained using [Tensor Processing Unit
246
+ (TPU)](https://cloud.google.com/tpu/docs/intro-to-tpu) hardware (TPUv4p, TPUv5p
247
+ and TPUv5e). Training generative models requires significant computational
248
+ power. TPUs, designed specifically for matrix operations common in machine
249
+ learning, offer several advantages in this domain:
250
+
251
+ - **Performance**: TPUs are specifically designed to handle the massive
252
+ computations involved in training generative models. They can speed up
253
+ training considerably compared to CPUs.
254
+ - **Memory**: TPUs often come with large amounts of high-bandwidth memory,
255
+ allowing for the handling of large models and batch sizes during training.
256
+ This can lead to better model quality.
257
+ - **Scalability**: TPU Pods (large clusters of TPUs) provide a scalable
258
+ solution for handling the growing complexity of large foundation models.
259
+ You can distribute training across multiple TPU devices for faster and more
260
+ efficient processing.
261
+ - **Cost-effectiveness**: In many scenarios, TPUs can provide a more
262
+ cost-effective solution for training large models compared to CPU-based
263
+ infrastructure, especially when considering the time and resources saved
264
+ due to faster training.
265
+
266
+ These advantages are aligned with
267
+ [Google's commitments to operate sustainably](https://sustainability.google/operating-sustainably/).
268
+
269
+ ### Software
270
+
271
+ Training was done using [JAX](https://github.com/jax-ml/jax) and
272
+ [ML Pathways](https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/).
273
+ JAX allows researchers to take advantage of the latest generation of hardware,
274
+ including TPUs, for faster and more efficient training of large models. ML
275
+ Pathways is Google's latest effort to build artificially intelligent systems
276
+ capable of generalizing across multiple tasks. This is specially suitable for
277
+ foundation models, including large language models like these ones.
278
+
279
+ Together, JAX and ML Pathways are used as described in the
280
+ [paper about the Gemini family of models](https://goo.gle/gemma2report):
281
+ *"the 'single controller' programming model of Jax and Pathways allows a single
282
+ Python process to orchestrate the entire training run, dramatically simplifying
283
+ the development workflow."*
284
+
285
+ ## Evaluation
286
+
287
+ Model evaluation metrics and results.
288
+
289
+ ### Benchmark Results
290
+
291
+ These models were evaluated at full precision (float32) against a large
292
+ collection of different datasets and metrics to cover different aspects of
293
+ content generation. Evaluation results marked with **IT** are for
294
+ instruction-tuned models. Evaluation results marked with **PT** are for
295
+ pre-trained models.
296
+
297
+ #### Reasoning and factuality
298
+
299
+ | Benchmark | Metric | n-shot | E2B PT | E4B PT |
300
+ | ------------------------------ |----------------|----------|:--------:|:--------:|
301
+ | [HellaSwag][hellaswag] | Accuracy | 10-shot | 72.2 | 78.6 |
302
+ | [BoolQ][boolq] | Accuracy | 0-shot | 76.4 | 81.6 |
303
+ | [PIQA][piqa] | Accuracy | 0-shot | 78.9 | 81.0 |
304
+ | [SocialIQA][socialiqa] | Accuracy | 0-shot | 48.8 | 50.0 |
305
+ | [TriviaQA][triviaqa] | Accuracy | 5-shot | 60.8 | 70.2 |
306
+ | [Natural Questions][naturalq] | Accuracy | 5-shot | 15.5 | 20.9 |
307
+ | [ARC-c][arc] | Accuracy | 25-shot | 51.7 | 61.6 |
308
+ | [ARC-e][arc] | Accuracy | 0-shot | 75.8 | 81.6 |
309
+ | [WinoGrande][winogrande] | Accuracy | 5-shot | 66.8 | 71.7 |
310
+ | [BIG-Bench Hard][bbh] | Accuracy | few-shot | 44.3 | 52.9 |
311
+ | [DROP][drop] | Token F1 score | 1-shot | 53.9 | 60.8 |
312
+
313
+ [hellaswag]: https://arxiv.org/abs/1905.07830
314
+ [boolq]: https://arxiv.org/abs/1905.10044
315
+ [piqa]: https://arxiv.org/abs/1911.11641
316
+ [socialiqa]: https://arxiv.org/abs/1904.09728
317
+ [triviaqa]: https://arxiv.org/abs/1705.03551
318
+ [naturalq]: https://github.com/google-research-datasets/natural-questions
319
+ [arc]: https://arxiv.org/abs/1911.01547
320
+ [winogrande]: https://arxiv.org/abs/1907.10641
321
+ [bbh]: https://paperswithcode.com/dataset/bbh
322
+ [drop]: https://arxiv.org/abs/1903.00161
323
+
324
+ #### Multilingual
325
+
326
+ | Benchmark | Metric | n-shot | E2B IT | E4B IT |
327
+ | ------------------------------------|-------------------------|----------|:--------:|:--------:|
328
+ | [MGSM][mgsm] | Accuracy | 0-shot | 53.1 | 60.7 |
329
+ | [WMT24++][wmt24pp] (ChrF) | Character-level F-score | 0-shot | 42.7 | 50.1 |
330
+ | [Include][include] | Accuracy | 0-shot | 38.6 | 57.2 |
331
+ | [MMLU][mmlu] (ProX) | Accuracy | 0-shot | 8.1 | 19.9 |
332
+ | [OpenAI MMLU][openai-mmlu] | Accuracy | 0-shot | 22.3 | 35.6 |
333
+ | [Global-MMLU][global-mmlu] | Accuracy | 0-shot | 55.1 | 60.3 |
334
+ | [ECLeKTic][eclektic] | ECLeKTic score | 0-shot | 2.5 | 1.9 |
335
+
336
+ [mgsm]: https://arxiv.org/abs/2210.03057
337
+ [wmt24pp]: https://arxiv.org/abs/2502.12404v1
338
+ [include]:https://arxiv.org/abs/2411.19799
339
+ [mmlu]: https://arxiv.org/abs/2009.03300
340
+ [openai-mmlu]: https://huggingface.co/datasets/openai/MMMLU
341
+ [global-mmlu]: https://huggingface.co/datasets/CohereLabs/Global-MMLU
342
+ [eclektic]: https://arxiv.org/abs/2502.21228
343
+
344
+ #### STEM and code
345
+
346
+ | Benchmark | Metric | n-shot | E2B IT | E4B IT |
347
+ | ------------------------------------|--------------------------|----------|:--------:|:--------:|
348
+ | [GPQA][gpqa] Diamond | RelaxedAccuracy/accuracy | 0-shot | 24.8 | 23.7 |
349
+ | [LiveCodeBench][lcb] v5 | pass@1 | 0-shot | 18.6 | 25.7 |
350
+ | Codegolf v2.2 | pass@1 | 0-shot | 11.0 | 16.8 |
351
+ | [AIME 2025][aime-2025] | Accuracy | 0-shot | 6.7 | 11.6 |
352
+
353
+ [gpqa]: https://arxiv.org/abs/2311.12022
354
+ [lcb]: https://arxiv.org/abs/2403.07974
355
+ [aime-2025]: https://www.vals.ai/benchmarks/aime-2025-05-09
356
+
357
+ #### Additional benchmarks
358
+
359
+ | Benchmark | Metric | n-shot | E2B IT | E4B IT |
360
+ | ------------------------------------ |------------|----------|:--------:|:--------:|
361
+ | [MMLU][mmlu] | Accuracy | 0-shot | 60.1 | 64.9 |
362
+ | [MBPP][mbpp] | pass@1 | 3-shot | 56.6 | 63.6 |
363
+ | [HumanEval][humaneval] | pass@1 | 0-shot | 66.5 | 75.0 |
364
+ | [LiveCodeBench][lcb] | pass@1 | 0-shot | 13.2 | 13.2 |
365
+ | HiddenMath | Accuracy | 0-shot | 27.7 | 37.7 |
366
+ | [Global-MMLU-Lite][global-mmlu-lite] | Accuracy | 0-shot | 59.0 | 64.5 |
367
+ | [MMLU][mmlu] (Pro) | Accuracy | 0-shot | 40.5 | 50.6 |
368
+
369
+ [gpqa]: https://arxiv.org/abs/2311.12022
370
+ [mbpp]: https://arxiv.org/abs/2108.07732
371
+ [humaneval]: https://arxiv.org/abs/2107.03374
372
+ [lcb]: https://arxiv.org/abs/2403.07974
373
+ [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite
374
+
375
+ ## Ethics and Safety
376
+
377
+ Ethics and safety evaluation approach and results.
378
+
379
+ ### Evaluation Approach
380
+
381
+ Our evaluation methods include structured evaluations and internal red-teaming
382
+ testing of relevant content policies. Red-teaming was conducted by a number of
383
+ different teams, each with different goals and human evaluation metrics. These
384
+ models were evaluated against a number of different categories relevant to
385
+ ethics and safety, including:
386
+
387
+ - **Child Safety**: Evaluation of text-to-text and image to text prompts
388
+ covering child safety policies, including child sexual abuse and
389
+ exploitation.
390
+ - **Content Safety:** Evaluation of text-to-text and image to text prompts
391
+ covering safety policies including, harassment, violence and gore, and hate
392
+ speech.
393
+ - **Representational Harms**: Evaluation of text-to-text and image to text
394
+ prompts covering safety policies including bias, stereotyping, and harmful
395
+ associations or inaccuracies.
396
+
397
+ In addition to development level evaluations, we conduct "assurance
398
+ evaluations" which are our 'arms-length' internal evaluations for responsibility
399
+ governance decision making. They are conducted separately from the model
400
+ development team, to inform decision making about release. High level findings
401
+ are fed back to the model team, but prompt sets are held-out to prevent
402
+ overfitting and preserve the results' ability to inform decision making. Notable
403
+ assurance evaluation results are reported to our Responsibility & Safety Council
404
+ as part of release review.
405
+
406
+ ### Evaluation Results
407
+
408
+ For all areas of safety testing, we saw safe levels of performance across the
409
+ categories of child safety, content safety, and representational harms relative
410
+ to previous Gemma models. All testing was conducted without safety filters to
411
+ evaluate the model capabilities and behaviors. For text-to-text, image-to-text,
412
+ and audio-to-text, and across all model sizes, the model produced minimal policy
413
+ violations, and showed significant improvements over previous Gemma models'
414
+ performance with respect to high severity violations. A limitation of our
415
+ evaluations was they included primarily English language prompts.
416
+
417
+ ## Usage and Limitations
418
+
419
+ These models have certain limitations that users should be aware of.
420
+
421
+ ### Intended Usage
422
+
423
+ Open generative models have a wide range of applications across various
424
+ industries and domains. The following list of potential uses is not
425
+ comprehensive. The purpose of this list is to provide contextual information
426
+ about the possible use-cases that the model creators considered as part of model
427
+ training and development.
428
+
429
+ - Content Creation and Communication
430
+ - **Text Generation**: Generate creative text formats such as
431
+ poems, scripts, code, marketing copy, and email drafts.
432
+ - **Chatbots and Conversational AI**: Power conversational
433
+ interfaces for customer service, virtual assistants, or interactive
434
+ applications.
435
+ - **Text Summarization**: Generate concise summaries of a text
436
+ corpus, research papers, or reports.
437
+ - **Image Data Extraction**: Extract, interpret, and summarize
438
+ visual data for text communications.
439
+ - **Audio Data Extraction**: Transcribe spoken language, translate speech
440
+ to text in other languages, and analyze sound-based data.
441
+ - Research and Education
442
+ - **Natural Language Processing (NLP) and generative model
443
+ Research**: These models can serve as a foundation for researchers to
444
+ experiment with generative models and NLP techniques, develop
445
+ algorithms, and contribute to the advancement of the field.
446
+ - **Language Learning Tools**: Support interactive language
447
+ learning experiences, aiding in grammar correction or providing writing
448
+ practice.
449
+ - **Knowledge Exploration**: Assist researchers in exploring large
450
+ bodies of data by generating summaries or answering questions about
451
+ specific topics.
452
+
453
+ ### Limitations
454
+
455
+ - Training Data
456
+ - The quality and diversity of the training data significantly
457
+ influence the model's capabilities. Biases or gaps in the training data
458
+ can lead to limitations in the model's responses.
459
+ - The scope of the training dataset determines the subject areas
460
+ the model can handle effectively.
461
+ - Context and Task Complexity
462
+ - Models are better at tasks that can be framed with clear
463
+ prompts and instructions. Open-ended or highly complex tasks might be
464
+ challenging.
465
+ - A model's performance can be influenced by the amount of context
466
+ provided (longer context generally leads to better outputs, up to a
467
+ certain point).
468
+ - Language Ambiguity and Nuance
469
+ - Natural language is inherently complex. Models might struggle
470
+ to grasp subtle nuances, sarcasm, or figurative language.
471
+ - Factual Accuracy
472
+ - Models generate responses based on information they learned
473
+ from their training datasets, but they are not knowledge bases. They
474
+ may generate incorrect or outdated factual statements.
475
+ - Common Sense
476
+ - Models rely on statistical patterns in language. They might
477
+ lack the ability to apply common sense reasoning in certain situations.
478
+
479
+ ### Ethical Considerations and Risks
480
+
481
+ The development of generative models raises several ethical concerns. In
482
+ creating an open model, we have carefully considered the following:
483
+
484
+ - Bias and Fairness
485
+ - Generative models trained on large-scale, real-world text and image data
486
+ can reflect socio-cultural biases embedded in the training material.
487
+ These models underwent careful scrutiny, input data pre-processing
488
+ described and posterior evaluations reported in this card.
489
+ - Misinformation and Misuse
490
+ - Generative models can be misused to generate text that is
491
+ false, misleading, or harmful.
492
+ - Guidelines are provided for responsible use with the model, see the
493
+ [Responsible Generative AI Toolkit](https://ai.google.dev/responsible).
494
+ - Transparency and Accountability:
495
+ - This model card summarizes details on the models' architecture,
496
+ capabilities, limitations, and evaluation processes.
497
+ - A responsibly developed open model offers the opportunity to
498
+ share innovation by making generative model technology accessible to
499
+ developers and researchers across the AI ecosystem.
500
+
501
+ Risks identified and mitigations:
502
+
503
+ - **Perpetuation of biases**: It's encouraged to perform continuous monitoring
504
+ (using evaluation metrics, human review) and the exploration of de-biasing
505
+ techniques during model training, fine-tuning, and other use cases.
506
+ - **Generation of harmful content**: Mechanisms and guidelines for content
507
+ safety are essential. Developers are encouraged to exercise caution and
508
+ implement appropriate content safety safeguards based on their specific
509
+ product policies and application use cases.
510
+ - **Misuse for malicious purposes**: Technical limitations and developer
511
+ and end-user education can help mitigate against malicious applications of
512
+ generative models. Educational resources and reporting mechanisms for users
513
+ to flag misuse are provided. Prohibited uses of Gemma models are outlined
514
+ in the
515
+ [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).
516
+ - **Privacy violations**: Models were trained on data filtered for removal of
517
+ certain personal information and other sensitive data. Developers are
518
+ encouraged to adhere to privacy regulations with privacy-preserving
519
+ techniques.
520
+
521
+ ### Benefits
522
+
523
+ At the time of release, this family of models provides high-performance open
524
+ generative model implementations designed from the ground up for responsible AI
525
+ development compared to similarly sized models.
526
+
527
+ Using the benchmark evaluation metrics described in this document, these models
528
+ have shown to provide superior performance to other, comparably-sized open model
529
+ alternatives.
chat_template.jinja ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {{ bos_token }}
2
+ {%- if messages[0]['role'] == 'system' -%}
3
+ {%- if messages[0]['content'] is string -%}
4
+ {%- set first_user_prefix = messages[0]['content'] + '
5
+
6
+ ' -%}
7
+ {%- else -%}
8
+ {%- set first_user_prefix = messages[0]['content'][0]['text'] + '
9
+
10
+ ' -%}
11
+ {%- endif -%}
12
+ {%- set loop_messages = messages[1:] -%}
13
+ {%- else -%}
14
+ {%- set first_user_prefix = "" -%}
15
+ {%- set loop_messages = messages -%}
16
+ {%- endif -%}
17
+ {%- for message in loop_messages -%}
18
+ {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
19
+ {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
20
+ {%- endif -%}
21
+ {%- if (message['role'] == 'assistant') -%}
22
+ {%- set role = "model" -%}
23
+ {%- else -%}
24
+ {%- set role = message['role'] -%}
25
+ {%- endif -%}
26
+ {{ '<start_of_turn>' + role + '
27
+ ' + (first_user_prefix if loop.first else "") }}
28
+ {%- if message['content'] is string -%}
29
+ {{ message['content'] | trim }}
30
+ {%- elif message['content'] is iterable -%}
31
+ {%- for item in message['content'] -%}
32
+ {%- if item['type'] == 'audio' -%}
33
+ {{ '<audio_soft_token>' }}
34
+ {%- elif item['type'] == 'image' -%}
35
+ {{ '<image_soft_token>' }}
36
+ {%- elif item['type'] == 'text' -%}
37
+ {{ item['text'] | trim }}
38
+ {%- endif -%}
39
+ {%- endfor -%}
40
+ {%- else -%}
41
+ {{ raise_exception("Invalid content type") }}
42
+ {%- endif -%}
43
+ {{ '<end_of_turn>
44
+ ' }}
45
+ {%- endfor -%}
46
+ {%- if add_generation_prompt -%}
47
+ {{'<start_of_turn>model
48
+ '}}
49
+ {%- endif -%}
config.json ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Gemma3nModel"
4
+ ],
5
+ "audio_config": {
6
+ "conf_attention_chunk_size": 12,
7
+ "conf_attention_context_left": 13,
8
+ "conf_attention_context_right": 0,
9
+ "conf_attention_logit_cap": 50.0,
10
+ "conf_conv_kernel_size": 5,
11
+ "conf_num_attention_heads": 8,
12
+ "conf_num_hidden_layers": 12,
13
+ "conf_positional_bias_size": 256,
14
+ "conf_reduction_factor": 4,
15
+ "conf_residual_weight": 0.5,
16
+ "gradient_clipping": 10000000000.0,
17
+ "hidden_size": 1536,
18
+ "input_feat_size": 128,
19
+ "model_type": "gemma3n_audio",
20
+ "rms_norm_eps": 1e-06,
21
+ "sscp_conv_channel_size": [
22
+ 128,
23
+ 32
24
+ ],
25
+ "sscp_conv_eps": 0.001,
26
+ "sscp_conv_group_norm_eps": 0.001,
27
+ "sscp_conv_kernel_size": [
28
+ [
29
+ 3,
30
+ 3
31
+ ],
32
+ [
33
+ 3,
34
+ 3
35
+ ]
36
+ ],
37
+ "sscp_conv_stride_size": [
38
+ [
39
+ 2,
40
+ 2
41
+ ],
42
+ [
43
+ 2,
44
+ 2
45
+ ]
46
+ ],
47
+ "torch_dtype": "float16",
48
+ "vocab_offset": 262272,
49
+ "vocab_size": 128
50
+ },
51
+ "audio_soft_tokens_per_image": 188,
52
+ "audio_token_id": 262273,
53
+ "boa_token_id": 256000,
54
+ "boi_token_id": 255999,
55
+ "eoa_token_id": 262272,
56
+ "eoi_token_id": 262144,
57
+ "eos_token_id": [
58
+ 1,
59
+ 106
60
+ ],
61
+ "image_token_id": 262145,
62
+ "initializer_range": 0.02,
63
+ "model_type": "gemma3n",
64
+ "text_config": {
65
+ "activation_sparsity_pattern": [
66
+ 0.95,
67
+ 0.95,
68
+ 0.95,
69
+ 0.95,
70
+ 0.95,
71
+ 0.95,
72
+ 0.95,
73
+ 0.95,
74
+ 0.95,
75
+ 0.95,
76
+ 0.0,
77
+ 0.0,
78
+ 0.0,
79
+ 0.0,
80
+ 0.0,
81
+ 0.0,
82
+ 0.0,
83
+ 0.0,
84
+ 0.0,
85
+ 0.0,
86
+ 0.0,
87
+ 0.0,
88
+ 0.0,
89
+ 0.0,
90
+ 0.0,
91
+ 0.0,
92
+ 0.0,
93
+ 0.0,
94
+ 0.0,
95
+ 0.0,
96
+ 0.0,
97
+ 0.0,
98
+ 0.0,
99
+ 0.0,
100
+ 0.0
101
+ ],
102
+ "altup_active_idx": 0,
103
+ "altup_coef_clip": 120.0,
104
+ "altup_correct_scale": true,
105
+ "altup_lr_multiplier": 1.0,
106
+ "altup_num_inputs": 4,
107
+ "attention_bias": false,
108
+ "attention_dropout": 0.0,
109
+ "final_logit_softcapping": 30.0,
110
+ "head_dim": 256,
111
+ "hidden_activation": "gelu_pytorch_tanh",
112
+ "hidden_size": 2048,
113
+ "hidden_size_per_layer_input": 256,
114
+ "initializer_range": 0.02,
115
+ "intermediate_size": [
116
+ 16384,
117
+ 16384,
118
+ 16384,
119
+ 16384,
120
+ 16384,
121
+ 16384,
122
+ 16384,
123
+ 16384,
124
+ 16384,
125
+ 16384,
126
+ 16384,
127
+ 16384,
128
+ 16384,
129
+ 16384,
130
+ 16384,
131
+ 16384,
132
+ 16384,
133
+ 16384,
134
+ 16384,
135
+ 16384,
136
+ 16384,
137
+ 16384,
138
+ 16384,
139
+ 16384,
140
+ 16384,
141
+ 16384,
142
+ 16384,
143
+ 16384,
144
+ 16384,
145
+ 16384,
146
+ 16384,
147
+ 16384,
148
+ 16384,
149
+ 16384,
150
+ 16384
151
+ ],
152
+ "laurel_rank": 64,
153
+ "layer_types": [
154
+ "sliding_attention",
155
+ "sliding_attention",
156
+ "sliding_attention",
157
+ "sliding_attention",
158
+ "full_attention",
159
+ "sliding_attention",
160
+ "sliding_attention",
161
+ "sliding_attention",
162
+ "sliding_attention",
163
+ "full_attention",
164
+ "sliding_attention",
165
+ "sliding_attention",
166
+ "sliding_attention",
167
+ "sliding_attention",
168
+ "full_attention",
169
+ "sliding_attention",
170
+ "sliding_attention",
171
+ "sliding_attention",
172
+ "sliding_attention",
173
+ "full_attention",
174
+ "sliding_attention",
175
+ "sliding_attention",
176
+ "sliding_attention",
177
+ "sliding_attention",
178
+ "full_attention",
179
+ "sliding_attention",
180
+ "sliding_attention",
181
+ "sliding_attention",
182
+ "sliding_attention",
183
+ "full_attention",
184
+ "sliding_attention",
185
+ "sliding_attention",
186
+ "sliding_attention",
187
+ "sliding_attention",
188
+ "full_attention"
189
+ ],
190
+ "max_position_embeddings": 32768,
191
+ "model_type": "gemma3n_text",
192
+ "num_attention_heads": 8,
193
+ "num_hidden_layers": 35,
194
+ "num_key_value_heads": 2,
195
+ "num_kv_shared_layers": 15,
196
+ "query_pre_attn_scalar": 256,
197
+ "rms_norm_eps": 1e-06,
198
+ "rope_local_base_freq": 10000.0,
199
+ "rope_scaling": null,
200
+ "rope_theta": 1000000.0,
201
+ "sliding_window": 512,
202
+ "torch_dtype": "float16",
203
+ "use_cache": true,
204
+ "vocab_size": 262400,
205
+ "vocab_size_per_layer_input": 262144
206
+ },
207
+ "torch_dtype": "float16",
208
+ "transformers_version": "4.53.0",
209
+ "vision_config": {
210
+ "architecture": "mobilenetv5_300m_enc",
211
+ "do_pooling": true,
212
+ "hidden_size": 2048,
213
+ "initializer_range": 0.02,
214
+ "label_names": [
215
+ "LABEL_0",
216
+ "LABEL_1"
217
+ ],
218
+ "model_args": null,
219
+ "model_type": "gemma3n_vision",
220
+ "num_classes": 2,
221
+ "rms_norm_eps": 1e-06,
222
+ "torch_dtype": "float16",
223
+ "vocab_offset": 262144,
224
+ "vocab_size": 128
225
+ },
226
+ "vision_soft_tokens_per_image": 256
227
+ }
generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 2,
3
+ "cache_implementation": "hybrid",
4
+ "do_sample": true,
5
+ "eos_token_id": [
6
+ 1,
7
+ 106
8
+ ],
9
+ "pad_token_id": 0,
10
+ "top_k": 64,
11
+ "top_p": 0.95,
12
+ "transformers_version": "4.53.0.dev0"
13
+ }
model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5e2bf44e2f3bc799c65be36b9cc0f9d90ed57aeeaf81238e5ac31db85d6bb5cb
3
+ size 4967927464
model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6bb5be03e96615000518b6e7513c25bb8f65f170513146cc07f45109a1a19447
3
+ size 4569243648
preprocessor_config.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": false,
5
+ "device": null,
6
+ "dither": 0.0,
7
+ "do_center_crop": null,
8
+ "do_convert_rgb": null,
9
+ "do_normalize": false,
10
+ "do_rescale": true,
11
+ "do_resize": true,
12
+ "feature_extractor_type": "Gemma3nAudioFeatureExtractor",
13
+ "feature_size": 128,
14
+ "fft_length": 1024,
15
+ "fft_overdrive": true,
16
+ "frame_length": 512,
17
+ "hop_length": 160,
18
+ "image_mean": [
19
+ 0.5,
20
+ 0.5,
21
+ 0.5
22
+ ],
23
+ "image_processor_type": "SiglipImageProcessorFast",
24
+ "image_seq_length": 256,
25
+ "image_std": [
26
+ 0.5,
27
+ 0.5,
28
+ 0.5
29
+ ],
30
+ "input_data_format": null,
31
+ "input_scale_factor": 1.0,
32
+ "max_frequency": 7600.0,
33
+ "mel_floor": 1e-05,
34
+ "min_frequency": 125.0,
35
+ "padding_side": "right",
36
+ "padding_value": 0.0,
37
+ "per_bin_mean": null,
38
+ "per_bin_stddev": null,
39
+ "preemphasis": 0.97,
40
+ "preemphasis_htk_flavor": true,
41
+ "processor_class": "Gemma3nProcessor",
42
+ "resample": 2,
43
+ "rescale_factor": 0.00392156862745098,
44
+ "return_attention_mask": false,
45
+ "return_tensors": null,
46
+ "sampling_rate": 16000,
47
+ "size": {
48
+ "height": 768,
49
+ "width": 768
50
+ }
51
+ }
processor_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "audio_seq_length": 188,
3
+ "image_seq_length": 256,
4
+ "processor_class": "Gemma3nProcessor"
5
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "audio_token": "<audio_soft_token>",
3
+ "boa_token": "<start_of_audio>",
4
+ "boi_token": "<start_of_image>",
5
+ "bos_token": {
6
+ "content": "<bos>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "eoa_token": "<end_of_audio>",
13
+ "eoi_token": "<end_of_image>",
14
+ "eos_token": {
15
+ "content": "<eos>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "image_token": "<image_soft_token>",
22
+ "pad_token": {
23
+ "content": "<pad>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false
28
+ },
29
+ "unk_token": {
30
+ "content": "<unk>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false
35
+ }
36
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c4c19736bf24d1c6805cf49340e31bd02c70fb7857a2cb31065c90c2b5719c4e
3
+ size 33442559
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea5f0cc48abfbfc04d14562270a32e02149a3e7035f368cc5a462786f4a59961
3
+ size 4696020
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff