Mungert commited on
Commit
85f90e4
·
verified ·
1 Parent(s): ccbaeb5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +169 -3
README.md CHANGED
@@ -1,3 +1,169 @@
1
- ---
2
- license: gemma
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ ---
4
+
5
+ # <span style="color: #7FFF7F;">Gemma-3 4B Instruct GGUF Models</span>
6
+
7
+
8
+ ## **Choosing the Right Model Format**
9
+
10
+ Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**.
11
+
12
+ ### **BF16 (Brain Float 16) – Use if BF16 acceleration is available**
13
+ - A 16-bit floating-point format designed for **faster computation** while retaining good precision.
14
+ - Provides **similar dynamic range** as FP32 but with **lower memory usage**.
15
+ - Recommended if your hardware supports **BF16 acceleration** (check your device’s specs).
16
+ - Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.
17
+
18
+ 📌 **Use BF16 if:**
19
+ ✔ Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).
20
+ ✔ You want **higher precision** while saving memory.
21
+ ✔ You plan to **requantize** the model into another format.
22
+
23
+ 📌 **Avoid BF16 if:**
24
+ ❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).
25
+ ❌ You need compatibility with older devices that lack BF16 optimization.
26
+
27
+ ---
28
+
29
+ ### **F16 (Float 16) – More widely supported than BF16**
30
+ - A 16-bit floating-point **high precision** but with less of range of values than BF16.
31
+ - Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).
32
+ - Slightly lower numerical precision than BF16 but generally sufficient for inference.
33
+
34
+ 📌 **Use F16 if:**
35
+ ✔ Your hardware supports **FP16** but **not BF16**.
36
+ ✔ You need a **balance between speed, memory usage, and accuracy**.
37
+ ✔ You are running on a **GPU** or another device optimized for FP16 computations.
38
+
39
+ 📌 **Avoid F16 if:**
40
+ ❌ Your device lacks **native FP16 support** (it may run slower than expected).
41
+ ❌ You have memory limtations.
42
+
43
+ ---
44
+
45
+ ### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference**
46
+ Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
47
+ - **Lower-bit models (Q4_K)** → **Best for minimal memory usage**, may have lower precision.
48
+ - **Higher-bit models (Q6_K, Q8_0)** → **Better accuracy**, requires more memory.
49
+
50
+ 📌 **Use Quantized Models if:**
51
+ ✔ You are running inference on a **CPU** and need an optimized model.
52
+ ✔ Your device has **low VRAM** and cannot load full-precision models.
53
+ ✔ You want to reduce **memory footprint** while keeping reasonable accuracy.
54
+
55
+ 📌 **Avoid Quantized Models if:**
56
+ ❌ You need **maximum accuracy** (full-precision models are better for this).
57
+ ❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).
58
+
59
+ ---
60
+
61
+ ### **Summary Table: Model Format Selection**
62
+
63
+ | Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
64
+ |--------------|------------|---------------|----------------------|---------------|
65
+ | **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
66
+ | **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn’t available |
67
+ | **Q4_K** | Low | Very Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
68
+ | **Q6_K** | Medium Low | Low | CPU with more memory | Better accuracy while still being quantized |
69
+ | **Q8** | Medium | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
70
+
71
+
72
+ ## **Included Files & Details**
73
+
74
+ ### `google_gemma-3-4b-it-bf16.gguf`
75
+ - Model weights preserved in **BF16**.
76
+ - Use this if you want to **requantize** the model into a different format.
77
+ - Best if your device supports **BF16 acceleration**.
78
+
79
+ ### `google_gemma-3-4b-it-f16.gguf`
80
+ - Model weights stored in **F16**.
81
+ - Use if your device supports **FP16**, especially if BF16 is not available.
82
+
83
+ ### `google_gemma-3-4b-it-bf16-q8.gguf`
84
+ - **Output & embeddings** remain in **BF16**.
85
+ - All other layers quantized to **Q8_0**.
86
+ - Use if your device supports **BF16** and you want a quantized version.
87
+
88
+ ### `google_gemma-3-4b-it-f16-q8.gguf`
89
+ - **Output & embeddings** remain in **F16**.
90
+ - All other layers quantized to **Q8_0**.
91
+
92
+ ### `google_gemma-3-4b-it-q4_k_l.gguf`
93
+ - **Output & embeddings** quantized to **Q8_0**.
94
+ - All other layers quantized to **Q4_K**.
95
+ - Good for **CPU inference** with limited memory.
96
+
97
+ ### `google_gemma-3-4b-it-q4_k_m.gguf`
98
+ - Similar to Q4_K.
99
+ - Another option for **low-memory CPU inference**.
100
+
101
+ ### `google_gemma-3-4b-it-q4_k_s.gguf`
102
+ - Smallest **Q4_K** variant, using less memory at the cost of accuracy.
103
+ - Best for **very low-memory setups**.
104
+
105
+ ### `google_gemma-3-4b-it-q6_k_l.gguf`
106
+ - **Output & embeddings** quantized to **Q8_0**.
107
+ - All other layers quantized to **Q6_K** .
108
+
109
+ ### `google_gemma-3-4b-it-q6_k_m.gguf`
110
+ - A mid-range **Q6_K** quantized model for balanced performance .
111
+ - Suitable for **CPU-based inference** with **moderate memory**.
112
+
113
+ ### `google_gemma-3-4b-it-q8.gguf`
114
+ - Fully **Q8** quantized model for better accuracy.
115
+ - Requires **more memory** but offers higher precision.
116
+
117
+
118
+ # Gemma 3 model card
119
+
120
+ **Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core)
121
+
122
+ **Resources and Technical Documentation**:
123
+
124
+ * [Gemma 3 Technical Report][g3-tech-report]
125
+ * [Responsible Generative AI Toolkit][rai-toolkit]
126
+ * [Gemma on Kaggle][kaggle-gemma]
127
+ * [Gemma on Vertex Model Garden][vertex-mg-gemma3]
128
+
129
+ **Terms of Use**: [Terms][terms]
130
+
131
+ **Authors**: Google DeepMind
132
+
133
+ ## Model Information
134
+
135
+ Summary description and brief definition of inputs and outputs.
136
+
137
+ ### Description
138
+
139
+ Gemma is a family of lightweight, state-of-the-art open models from Google,
140
+ built from the same research and technology used to create the Gemini models.
141
+ Gemma 3 models are multimodal, handling text and image input and generating text
142
+ output, with open weights for both pre-trained variants and instruction-tuned
143
+ variants. Gemma 3 has a large, 128K context window, multilingual support in over
144
+ 140 languages, and is available in more sizes than previous versions. Gemma 3
145
+ models are well-suited for a variety of text generation and image understanding
146
+ tasks, including question answering, summarization, and reasoning. Their
147
+ relatively small size makes it possible to deploy them in environments with
148
+ limited resources such as laptops, desktops or your own cloud infrastructure,
149
+ democratizing access to state of the art AI models and helping foster innovation
150
+ for everyone.
151
+
152
+ ### Inputs and outputs
153
+
154
+ - **Input:**
155
+ - Text string, such as a question, a prompt, or a document to be summarized
156
+ - Images, normalized to 896 x 896 resolution and encoded to 256 tokens
157
+ each
158
+ - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and
159
+ 32K tokens for the 1B size
160
+
161
+ - **Output:**
162
+ - Generated text in response to the input, such as an answer to a
163
+ question, analysis of image content, or a summary of a document
164
+ - Total output context of 8192 tokens
165
+
166
+ ## Credits
167
+
168
+ Thanks [Bartowski](https://huggingface.co/bartowski) for imartix upload. And your guidance on quantization that has enabled me to produce these gguf file.
169
+