Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,263 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
---
|
4 |
+
###### bubba
|
5 |
+
|
6 |
+
**Quantized, and fine-tuned GGUF based on OpenAI’s `gpt-oss-20b`**
|
7 |
+
Format: **GGUF** (for `llama.cpp` and compatible runtimes) • Quantization: **Q4_K_XL (4-bit, K-grouped, extra-low loss)**
|
8 |
+
File: `bubba-20b-Q4_K_XL.gguf`
|
9 |
+
|
10 |
+
|
11 |
+
|
12 |
+
## 🧠 Overview
|
13 |
+
|
14 |
+
- This repo provides a **4-bit K-quantized** `.gguf` for fast local inference of a 20B-parameter model
|
15 |
+
derived from **OpenAI’s `gpt-oss-20b`** (as reported by the uploader).
|
16 |
+
- **Use cases:** general chat/instruction following, coding help, knowledge Q&A
|
17 |
+
(see Intended Use & Limitations).
|
18 |
+
- **Works with:** `llama.cpp`, `llama-cpp-python`, KoboldCPP, Text Generation WebUI, LM Studio,
|
19 |
+
and other GGUF-compatible backends.
|
20 |
+
- **Hardware guidance (rule of thumb):** ~12–16 GB VRAM/RAM for comfortable batch-1 inference
|
21 |
+
with Q4_K_XL; CPU-only works too (expect lower tokens/s).
|
22 |
+
|
23 |
+
> ⚠️ **Provenance & license**: This quant is produced from a base model claimed to be OpenAI’s
|
24 |
+
> `gpt-oss-20b`. Please **review and comply with the original model’s license/terms**. The GGUF
|
25 |
+
> quantization **inherits** those terms. See the **License** section.
|
26 |
+
|
27 |
+
|
28 |
+
|
29 |
+
## ❓ What’s Inside
|
30 |
+
|
31 |
+
- **Architecture:** decoder-only transformer (20B params)
|
32 |
+
- **Context window:** inherits base model’s max context (not changed by quantization)
|
33 |
+
- **Vocabulary & tokenizer:** inherited from base model (packed into GGUF)
|
34 |
+
- **Quantization:** `Q4_K_XL` (grouped 4-bit with outlier handling; designed to preserve quality
|
35 |
+
versus classic Q4)
|
36 |
+
|
37 |
+
#### Why Q4_K_XL?
|
38 |
+
|
39 |
+
- **Quality vs size:** better perplexity retention than older 4-bit schemes while remaining small.
|
40 |
+
- **Speed:** efficient on modern CPUs (AVX2/AVX-512) and GPUs via `llama.cpp` backends.
|
41 |
+
- **When to pick it:** balanced default for laptops/desktops when 5- or 6-bit variants are too large.
|
42 |
+
|
43 |
+
|
44 |
+
|
45 |
+
## 📝 Intended Use & Limitations
|
46 |
+
|
47 |
+
### Intended Use
|
48 |
+
|
49 |
+
- Instruction following, general dialogue
|
50 |
+
- Code assistance (reasoning, boilerplate, refactoring)
|
51 |
+
- Knowledge/Q&A within the model’s training cutoff
|
52 |
+
|
53 |
+
### Out-of-Scope / Known Limitations
|
54 |
+
|
55 |
+
- **Factuality:** may produce inaccurate or outdated info
|
56 |
+
- **Safety:** can emit biased or unsafe text; **apply your own filters/guardrails**
|
57 |
+
- **High-stakes decisions:** not for medical, legal, financial, or safety-critical use
|
58 |
+
|
59 |
+
|
60 |
+
|
61 |
+
## 🎯 Quick Start
|
62 |
+
|
63 |
+
### Run with `llama.cpp` (CLI)
|
64 |
+
|
65 |
+
```bash
|
66 |
+
# Download llama.cpp, build it, then run:
|
67 |
+
# -m : path to this GGUF file
|
68 |
+
# -ngl: how many layers to offload to GPU (0 = CPU-only). Adjust to your VRAM.
|
69 |
+
# -c : context tokens (do not exceed base model’s max)
|
70 |
+
# -t : CPU threads
|
71 |
+
./main \
|
72 |
+
-m ./bubba-20b-Q4_K_XL.gguf \
|
73 |
+
-p "Write a short haiku about quantization." \
|
74 |
+
-n 128 -t 8 -c 4096 -ngl 35
|
75 |
+
```
|
76 |
+
|
77 |
+
### Run as a server (OpenAI-style) with `llama.cpp`
|
78 |
+
|
79 |
+
```bash
|
80 |
+
# Starts a local HTTP server at http://localhost:8080/v1
|
81 |
+
./server -m ./bubba-20b-Q4_K_XL.gguf -c 4096 -ngl 35
|
82 |
+
```
|
83 |
+
|
84 |
+
Then call it:
|
85 |
+
|
86 |
+
```bash
|
87 |
+
# Minimal curl example against llama.cpp server
|
88 |
+
curl http://localhost:8080/v1/completions \
|
89 |
+
-H "Content-Type: application/json" \
|
90 |
+
-d '{
|
91 |
+
"model": "bubba-20b-Q4_K_XL.gguf",
|
92 |
+
"prompt": "Explain Q4_K_XL quantization in one paragraph.",
|
93 |
+
"max_tokens": 128
|
94 |
+
}'
|
95 |
+
```
|
96 |
+
|
97 |
+
### Python (`llama-cpp-python`)
|
98 |
+
|
99 |
+
```python
|
100 |
+
"""
|
101 |
+
Example: Minimal Python inference with llama-cpp-python.
|
102 |
+
|
103 |
+
Prereqs:
|
104 |
+
pip install llama-cpp-python # Optionally with CUDA/Metal extras for GPU
|
105 |
+
|
106 |
+
Notes:
|
107 |
+
- Set n_ctx <= base model's context length.
|
108 |
+
- n_gpu_layers > 0 to offload layers to GPU (if available).
|
109 |
+
"""
|
110 |
+
from llama_cpp import Llama
|
111 |
+
|
112 |
+
llm = Llama(
|
113 |
+
model_path="bubba-20b-Q4_K_XL.gguf",
|
114 |
+
n_ctx=4096,
|
115 |
+
n_threads=8,
|
116 |
+
n_gpu_layers=35 # set to 0 for CPU-only
|
117 |
+
)
|
118 |
+
|
119 |
+
prompt = "You are a helpful assistant. In 3 bullet points, define K-quantization."
|
120 |
+
out = llm(prompt, max_tokens=160, temperature=0.7, top_p=0.9)
|
121 |
+
print(out["choices"][0]["text"])
|
122 |
+
```
|
123 |
+
|
124 |
+
## 📘 Text Generation WebUI
|
125 |
+
|
126 |
+
- Place the GGUF in `text-generation-webui/models/bubba-20b-Q4_K_XL/`
|
127 |
+
- Launch with the `llama.cpp` loader (or `llama-cpp-python` backend)
|
128 |
+
- Select the model in the UI, adjust **context length**, **GPU layers**, and **sampling**
|
129 |
+
|
130 |
+
## 🧩 KoboldCPP
|
131 |
+
|
132 |
+
```bash
|
133 |
+
./koboldcpp \
|
134 |
+
-m bubba-20b-Q4_K_XL.gguf \
|
135 |
+
--contextsize 4096 \
|
136 |
+
--gpulayers 35 \
|
137 |
+
--usecublas
|
138 |
+
```
|
139 |
+
|
140 |
+
## ⚡ LM Studio
|
141 |
+
|
142 |
+
1. Open **LM Studio** → **Models** → **Local models** → **Add local model** and select the `.gguf`.
|
143 |
+
2. In **Chat**, pick the model, set **Context length** (≤ base model max), and adjust **GPU Layers**.
|
144 |
+
3. For API use, enable **Local Server** and target the exposed endpoint with OpenAI-compatible clients.
|
145 |
+
|
146 |
+
|
147 |
+
|
148 |
+
## ❓ Prompting
|
149 |
+
|
150 |
+
This build is instruction-tuned (downstream behavior depends on your base). Common prompt patterns work:
|
151 |
+
|
152 |
+
**Simple instruction**
|
153 |
+
```
|
154 |
+
Write a concise summary of the benefits of grouped 4-bit quantization.
|
155 |
+
```
|
156 |
+
|
157 |
+
**ChatML-like**
|
158 |
+
```
|
159 |
+
<|system|>
|
160 |
+
You are a helpful, concise assistant.
|
161 |
+
<|user|>
|
162 |
+
Compare Q4_K_XL vs Q5_K_M in terms of quality and RAM.
|
163 |
+
<|assistant|>
|
164 |
+
```
|
165 |
+
|
166 |
+
**Code task**
|
167 |
+
```
|
168 |
+
Task: Write a Python function that computes perplexity given log-likelihoods.
|
169 |
+
Constraints: Include docstrings and type hints.
|
170 |
+
```
|
171 |
+
|
172 |
+
> **Tip:** Keep prompts **explicit and structured** (roles, constraints, examples).
|
173 |
+
> Suggested starting points: temperature 0.2–0.8, top_p 0.8–0.95, repeat_penalty 1.05–1.15.
|
174 |
+
|
175 |
+
|
176 |
+
|
177 |
+
## ⚙️ Performance & Memory Guidance (Rules of Thumb)
|
178 |
+
|
179 |
+
- **RAM/VRAM for Q4_K_XL (20B):** ~12–16 GB for batch-1 inference (varies by backend and offloading).
|
180 |
+
- **Throughput:** Highly dependent on CPU/GPU, backend, context length, and GPU offload.
|
181 |
+
Start with **`-ngl`** as high as your VRAM allows, then tune threads/batch sizes.
|
182 |
+
- **Context window:** Do not exceed the base model’s maximum (quantization does not increase it).
|
183 |
+
|
184 |
+
|
185 |
+
|
186 |
+
## 💻 Files
|
187 |
+
|
188 |
+
- `bubba-20b-Q4_K_XL.gguf` — 4-bit K-quantized weights (XL variant)
|
189 |
+
- `tokenizer.*` — packed inside GGUF (no separate files needed)
|
190 |
+
|
191 |
+
> **Integrity:** Verify your download (e.g., SHA256) if provided by the host/mirror.
|
192 |
+
|
193 |
+
|
194 |
+
|
195 |
+
## ⚙️ GGUF Format
|
196 |
+
|
197 |
+
1. Start from the base `gpt-oss-20b` weights (FP16/BF16).
|
198 |
+
2. Convert to GGUF with `llama.cpp`’s `convert` tooling (or equivalent for the base arch).
|
199 |
+
3. Quantize with `llama.cpp` `quantize` to **Q4_K_XL**.
|
200 |
+
4. Sanity-check perplexity/behavior, package with metadata.
|
201 |
+
|
202 |
+
> Exact scripts/commits may vary by environment; please share your pipeline for full reproducibility
|
203 |
+
> if you fork this card.
|
204 |
+
|
205 |
+
|
206 |
+
|
207 |
+
## 🏁 Safety, Bias & Responsible Use
|
208 |
+
|
209 |
+
Large language models can generate **plausible but incorrect or harmful** content and may reflect
|
210 |
+
**societal biases**. If you deploy this model:
|
211 |
+
|
212 |
+
- Add **moderation/guardrails** and domain-specific filters.
|
213 |
+
- Provide **user disclaimers** and feedback channels.
|
214 |
+
- Keep **human-in-the-loop** for consequential outputs.
|
215 |
+
|
216 |
+
|
217 |
+
## 📝 License
|
218 |
+
|
219 |
+
- **Base model:** OpenAI `gpt-oss-20b` (as stated by the uploader). **You must review and follow
|
220 |
+
the base model’s license/terms and any OpenAI restrictions that apply.**
|
221 |
+
- **This quantized GGUF:** Distributed under the **same license/terms as the base model**.
|
222 |
+
No additional rights are granted.
|
223 |
+
|
224 |
+
If you are the rights holder and see an issue with distribution, please open an issue on this
|
225 |
+
repo/model card.
|
226 |
+
|
227 |
+
|
228 |
+
|
229 |
+
## 🧩 Attribution
|
230 |
+
|
231 |
+
If this quant helped you, consider citing like:
|
232 |
+
|
233 |
+
```
|
234 |
+
bubba-20b–Q4_K_XL.gguf (2025).
|
235 |
+
Quantized GGUF build derived from OpenAI’s gpt-oss-20b.
|
236 |
+
Retrieved from the Hugging Face Hub.
|
237 |
+
```
|
238 |
+
|
239 |
+
|
240 |
+
|
241 |
+
## ❓ FAQ
|
242 |
+
|
243 |
+
**Does quantization change the context window or tokenizer?**
|
244 |
+
No. Those are inherited from the base model; quantization only changes weight representation.
|
245 |
+
|
246 |
+
**Why am I hitting out-of-memory?**
|
247 |
+
Lower `-ngl` (fewer GPU layers), reduce context (`-c`), or switch to a smaller quant (e.g., Q3_K).
|
248 |
+
Ensure no other large models occupy VRAM.
|
249 |
+
|
250 |
+
**Best sampler settings?**
|
251 |
+
Start with temp 0.7, top_p 0.9, repeat_penalty 1.1.
|
252 |
+
Lower temperature for coding/planning; raise for creative writing.
|
253 |
+
|
254 |
+
|
255 |
+
|
256 |
+
## 📝 Changelog
|
257 |
+
|
258 |
+
- **v1.0** — Initial release of `bubba-20b-Q4_K_XL.gguf`.
|
259 |
+
|
260 |
+
|
261 |
+
|
262 |
+
*Made with ❤️ by Bro — because the code (and the prompts) should just work.*
|
263 |
+
|