Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,169 @@
|
|
1 |
-
---
|
2 |
-
license: gemma
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: gemma
|
3 |
+
---
|
4 |
+
|
5 |
+
# <span style="color: #7FFF7F;">Gemma-3 4B Instruct GGUF Models</span>
|
6 |
+
|
7 |
+
|
8 |
+
## **Choosing the Right Model Format**
|
9 |
+
|
10 |
+
Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**.
|
11 |
+
|
12 |
+
### **BF16 (Brain Float 16) – Use if BF16 acceleration is available**
|
13 |
+
- A 16-bit floating-point format designed for **faster computation** while retaining good precision.
|
14 |
+
- Provides **similar dynamic range** as FP32 but with **lower memory usage**.
|
15 |
+
- Recommended if your hardware supports **BF16 acceleration** (check your device’s specs).
|
16 |
+
- Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.
|
17 |
+
|
18 |
+
📌 **Use BF16 if:**
|
19 |
+
✔ Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).
|
20 |
+
✔ You want **higher precision** while saving memory.
|
21 |
+
✔ You plan to **requantize** the model into another format.
|
22 |
+
|
23 |
+
📌 **Avoid BF16 if:**
|
24 |
+
❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).
|
25 |
+
❌ You need compatibility with older devices that lack BF16 optimization.
|
26 |
+
|
27 |
+
---
|
28 |
+
|
29 |
+
### **F16 (Float 16) – More widely supported than BF16**
|
30 |
+
- A 16-bit floating-point **high precision** but with less of range of values than BF16.
|
31 |
+
- Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).
|
32 |
+
- Slightly lower numerical precision than BF16 but generally sufficient for inference.
|
33 |
+
|
34 |
+
📌 **Use F16 if:**
|
35 |
+
✔ Your hardware supports **FP16** but **not BF16**.
|
36 |
+
✔ You need a **balance between speed, memory usage, and accuracy**.
|
37 |
+
✔ You are running on a **GPU** or another device optimized for FP16 computations.
|
38 |
+
|
39 |
+
📌 **Avoid F16 if:**
|
40 |
+
❌ Your device lacks **native FP16 support** (it may run slower than expected).
|
41 |
+
❌ You have memory limtations.
|
42 |
+
|
43 |
+
---
|
44 |
+
|
45 |
+
### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference**
|
46 |
+
Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
|
47 |
+
- **Lower-bit models (Q4_K)** → **Best for minimal memory usage**, may have lower precision.
|
48 |
+
- **Higher-bit models (Q6_K, Q8_0)** → **Better accuracy**, requires more memory.
|
49 |
+
|
50 |
+
📌 **Use Quantized Models if:**
|
51 |
+
✔ You are running inference on a **CPU** and need an optimized model.
|
52 |
+
✔ Your device has **low VRAM** and cannot load full-precision models.
|
53 |
+
✔ You want to reduce **memory footprint** while keeping reasonable accuracy.
|
54 |
+
|
55 |
+
📌 **Avoid Quantized Models if:**
|
56 |
+
❌ You need **maximum accuracy** (full-precision models are better for this).
|
57 |
+
❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).
|
58 |
+
|
59 |
+
---
|
60 |
+
|
61 |
+
### **Summary Table: Model Format Selection**
|
62 |
+
|
63 |
+
| Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
|
64 |
+
|--------------|------------|---------------|----------------------|---------------|
|
65 |
+
| **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
|
66 |
+
| **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn’t available |
|
67 |
+
| **Q4_K** | Low | Very Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
|
68 |
+
| **Q6_K** | Medium Low | Low | CPU with more memory | Better accuracy while still being quantized |
|
69 |
+
| **Q8** | Medium | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
|
70 |
+
|
71 |
+
|
72 |
+
## **Included Files & Details**
|
73 |
+
|
74 |
+
### `google_gemma-3-4b-it-bf16.gguf`
|
75 |
+
- Model weights preserved in **BF16**.
|
76 |
+
- Use this if you want to **requantize** the model into a different format.
|
77 |
+
- Best if your device supports **BF16 acceleration**.
|
78 |
+
|
79 |
+
### `google_gemma-3-4b-it-f16.gguf`
|
80 |
+
- Model weights stored in **F16**.
|
81 |
+
- Use if your device supports **FP16**, especially if BF16 is not available.
|
82 |
+
|
83 |
+
### `google_gemma-3-4b-it-bf16-q8.gguf`
|
84 |
+
- **Output & embeddings** remain in **BF16**.
|
85 |
+
- All other layers quantized to **Q8_0**.
|
86 |
+
- Use if your device supports **BF16** and you want a quantized version.
|
87 |
+
|
88 |
+
### `google_gemma-3-4b-it-f16-q8.gguf`
|
89 |
+
- **Output & embeddings** remain in **F16**.
|
90 |
+
- All other layers quantized to **Q8_0**.
|
91 |
+
|
92 |
+
### `google_gemma-3-4b-it-q4_k_l.gguf`
|
93 |
+
- **Output & embeddings** quantized to **Q8_0**.
|
94 |
+
- All other layers quantized to **Q4_K**.
|
95 |
+
- Good for **CPU inference** with limited memory.
|
96 |
+
|
97 |
+
### `google_gemma-3-4b-it-q4_k_m.gguf`
|
98 |
+
- Similar to Q4_K.
|
99 |
+
- Another option for **low-memory CPU inference**.
|
100 |
+
|
101 |
+
### `google_gemma-3-4b-it-q4_k_s.gguf`
|
102 |
+
- Smallest **Q4_K** variant, using less memory at the cost of accuracy.
|
103 |
+
- Best for **very low-memory setups**.
|
104 |
+
|
105 |
+
### `google_gemma-3-4b-it-q6_k_l.gguf`
|
106 |
+
- **Output & embeddings** quantized to **Q8_0**.
|
107 |
+
- All other layers quantized to **Q6_K** .
|
108 |
+
|
109 |
+
### `google_gemma-3-4b-it-q6_k_m.gguf`
|
110 |
+
- A mid-range **Q6_K** quantized model for balanced performance .
|
111 |
+
- Suitable for **CPU-based inference** with **moderate memory**.
|
112 |
+
|
113 |
+
### `google_gemma-3-4b-it-q8.gguf`
|
114 |
+
- Fully **Q8** quantized model for better accuracy.
|
115 |
+
- Requires **more memory** but offers higher precision.
|
116 |
+
|
117 |
+
|
118 |
+
# Gemma 3 model card
|
119 |
+
|
120 |
+
**Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core)
|
121 |
+
|
122 |
+
**Resources and Technical Documentation**:
|
123 |
+
|
124 |
+
* [Gemma 3 Technical Report][g3-tech-report]
|
125 |
+
* [Responsible Generative AI Toolkit][rai-toolkit]
|
126 |
+
* [Gemma on Kaggle][kaggle-gemma]
|
127 |
+
* [Gemma on Vertex Model Garden][vertex-mg-gemma3]
|
128 |
+
|
129 |
+
**Terms of Use**: [Terms][terms]
|
130 |
+
|
131 |
+
**Authors**: Google DeepMind
|
132 |
+
|
133 |
+
## Model Information
|
134 |
+
|
135 |
+
Summary description and brief definition of inputs and outputs.
|
136 |
+
|
137 |
+
### Description
|
138 |
+
|
139 |
+
Gemma is a family of lightweight, state-of-the-art open models from Google,
|
140 |
+
built from the same research and technology used to create the Gemini models.
|
141 |
+
Gemma 3 models are multimodal, handling text and image input and generating text
|
142 |
+
output, with open weights for both pre-trained variants and instruction-tuned
|
143 |
+
variants. Gemma 3 has a large, 128K context window, multilingual support in over
|
144 |
+
140 languages, and is available in more sizes than previous versions. Gemma 3
|
145 |
+
models are well-suited for a variety of text generation and image understanding
|
146 |
+
tasks, including question answering, summarization, and reasoning. Their
|
147 |
+
relatively small size makes it possible to deploy them in environments with
|
148 |
+
limited resources such as laptops, desktops or your own cloud infrastructure,
|
149 |
+
democratizing access to state of the art AI models and helping foster innovation
|
150 |
+
for everyone.
|
151 |
+
|
152 |
+
### Inputs and outputs
|
153 |
+
|
154 |
+
- **Input:**
|
155 |
+
- Text string, such as a question, a prompt, or a document to be summarized
|
156 |
+
- Images, normalized to 896 x 896 resolution and encoded to 256 tokens
|
157 |
+
each
|
158 |
+
- Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and
|
159 |
+
32K tokens for the 1B size
|
160 |
+
|
161 |
+
- **Output:**
|
162 |
+
- Generated text in response to the input, such as an answer to a
|
163 |
+
question, analysis of image content, or a summary of a document
|
164 |
+
- Total output context of 8192 tokens
|
165 |
+
|
166 |
+
## Credits
|
167 |
+
|
168 |
+
Thanks [Bartowski](https://huggingface.co/bartowski) for imartix upload. And your guidance on quantization that has enabled me to produce these gguf file.
|
169 |
+
|