leeroy-jankins commited on
Commit
3b1067d
·
verified ·
1 Parent(s): 32a17d3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +263 -3
README.md CHANGED
@@ -1,3 +1,263 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ ###### bubba
5
+
6
+ **Quantized, and fine-tuned GGUF based on OpenAI’s `gpt-oss-20b`**
7
+ Format: **GGUF** (for `llama.cpp` and compatible runtimes) • Quantization: **Q4_K_XL (4-bit, K-grouped, extra-low loss)**
8
+ File: `bubba-20b-Q4_K_XL.gguf`
9
+
10
+
11
+
12
+ ## 🧠 Overview
13
+
14
+ - This repo provides a **4-bit K-quantized** `.gguf` for fast local inference of a 20B-parameter model
15
+ derived from **OpenAI’s `gpt-oss-20b`** (as reported by the uploader).
16
+ - **Use cases:** general chat/instruction following, coding help, knowledge Q&A
17
+ (see Intended Use & Limitations).
18
+ - **Works with:** `llama.cpp`, `llama-cpp-python`, KoboldCPP, Text Generation WebUI, LM Studio,
19
+ and other GGUF-compatible backends.
20
+ - **Hardware guidance (rule of thumb):** ~12–16 GB VRAM/RAM for comfortable batch-1 inference
21
+ with Q4_K_XL; CPU-only works too (expect lower tokens/s).
22
+
23
+ > ⚠️ **Provenance & license**: This quant is produced from a base model claimed to be OpenAI’s
24
+ > `gpt-oss-20b`. Please **review and comply with the original model’s license/terms**. The GGUF
25
+ > quantization **inherits** those terms. See the **License** section.
26
+
27
+
28
+
29
+ ## ❓ What’s Inside
30
+
31
+ - **Architecture:** decoder-only transformer (20B params)
32
+ - **Context window:** inherits base model’s max context (not changed by quantization)
33
+ - **Vocabulary & tokenizer:** inherited from base model (packed into GGUF)
34
+ - **Quantization:** `Q4_K_XL` (grouped 4-bit with outlier handling; designed to preserve quality
35
+ versus classic Q4)
36
+
37
+ #### Why Q4_K_XL?
38
+
39
+ - **Quality vs size:** better perplexity retention than older 4-bit schemes while remaining small.
40
+ - **Speed:** efficient on modern CPUs (AVX2/AVX-512) and GPUs via `llama.cpp` backends.
41
+ - **When to pick it:** balanced default for laptops/desktops when 5- or 6-bit variants are too large.
42
+
43
+
44
+
45
+ ## 📝 Intended Use & Limitations
46
+
47
+ ### Intended Use
48
+
49
+ - Instruction following, general dialogue
50
+ - Code assistance (reasoning, boilerplate, refactoring)
51
+ - Knowledge/Q&A within the model’s training cutoff
52
+
53
+ ### Out-of-Scope / Known Limitations
54
+
55
+ - **Factuality:** may produce inaccurate or outdated info
56
+ - **Safety:** can emit biased or unsafe text; **apply your own filters/guardrails**
57
+ - **High-stakes decisions:** not for medical, legal, financial, or safety-critical use
58
+
59
+
60
+
61
+ ## 🎯 Quick Start
62
+
63
+ ### Run with `llama.cpp` (CLI)
64
+
65
+ ```bash
66
+ # Download llama.cpp, build it, then run:
67
+ # -m : path to this GGUF file
68
+ # -ngl: how many layers to offload to GPU (0 = CPU-only). Adjust to your VRAM.
69
+ # -c : context tokens (do not exceed base model’s max)
70
+ # -t : CPU threads
71
+ ./main \
72
+ -m ./bubba-20b-Q4_K_XL.gguf \
73
+ -p "Write a short haiku about quantization." \
74
+ -n 128 -t 8 -c 4096 -ngl 35
75
+ ```
76
+
77
+ ### Run as a server (OpenAI-style) with `llama.cpp`
78
+
79
+ ```bash
80
+ # Starts a local HTTP server at http://localhost:8080/v1
81
+ ./server -m ./bubba-20b-Q4_K_XL.gguf -c 4096 -ngl 35
82
+ ```
83
+
84
+ Then call it:
85
+
86
+ ```bash
87
+ # Minimal curl example against llama.cpp server
88
+ curl http://localhost:8080/v1/completions \
89
+ -H "Content-Type: application/json" \
90
+ -d '{
91
+ "model": "bubba-20b-Q4_K_XL.gguf",
92
+ "prompt": "Explain Q4_K_XL quantization in one paragraph.",
93
+ "max_tokens": 128
94
+ }'
95
+ ```
96
+
97
+ ### Python (`llama-cpp-python`)
98
+
99
+ ```python
100
+ """
101
+ Example: Minimal Python inference with llama-cpp-python.
102
+
103
+ Prereqs:
104
+ pip install llama-cpp-python # Optionally with CUDA/Metal extras for GPU
105
+
106
+ Notes:
107
+ - Set n_ctx <= base model's context length.
108
+ - n_gpu_layers > 0 to offload layers to GPU (if available).
109
+ """
110
+ from llama_cpp import Llama
111
+
112
+ llm = Llama(
113
+ model_path="bubba-20b-Q4_K_XL.gguf",
114
+ n_ctx=4096,
115
+ n_threads=8,
116
+ n_gpu_layers=35 # set to 0 for CPU-only
117
+ )
118
+
119
+ prompt = "You are a helpful assistant. In 3 bullet points, define K-quantization."
120
+ out = llm(prompt, max_tokens=160, temperature=0.7, top_p=0.9)
121
+ print(out["choices"][0]["text"])
122
+ ```
123
+
124
+ ## 📘 Text Generation WebUI
125
+
126
+ - Place the GGUF in `text-generation-webui/models/bubba-20b-Q4_K_XL/`
127
+ - Launch with the `llama.cpp` loader (or `llama-cpp-python` backend)
128
+ - Select the model in the UI, adjust **context length**, **GPU layers**, and **sampling**
129
+
130
+ ## 🧩 KoboldCPP
131
+
132
+ ```bash
133
+ ./koboldcpp \
134
+ -m bubba-20b-Q4_K_XL.gguf \
135
+ --contextsize 4096 \
136
+ --gpulayers 35 \
137
+ --usecublas
138
+ ```
139
+
140
+ ## ⚡ LM Studio
141
+
142
+ 1. Open **LM Studio** → **Models** → **Local models** → **Add local model** and select the `.gguf`.
143
+ 2. In **Chat**, pick the model, set **Context length** (≤ base model max), and adjust **GPU Layers**.
144
+ 3. For API use, enable **Local Server** and target the exposed endpoint with OpenAI-compatible clients.
145
+
146
+
147
+
148
+ ## ❓ Prompting
149
+
150
+ This build is instruction-tuned (downstream behavior depends on your base). Common prompt patterns work:
151
+
152
+ **Simple instruction**
153
+ ```
154
+ Write a concise summary of the benefits of grouped 4-bit quantization.
155
+ ```
156
+
157
+ **ChatML-like**
158
+ ```
159
+ <|system|>
160
+ You are a helpful, concise assistant.
161
+ <|user|>
162
+ Compare Q4_K_XL vs Q5_K_M in terms of quality and RAM.
163
+ <|assistant|>
164
+ ```
165
+
166
+ **Code task**
167
+ ```
168
+ Task: Write a Python function that computes perplexity given log-likelihoods.
169
+ Constraints: Include docstrings and type hints.
170
+ ```
171
+
172
+ > **Tip:** Keep prompts **explicit and structured** (roles, constraints, examples).
173
+ > Suggested starting points: temperature 0.2–0.8, top_p 0.8–0.95, repeat_penalty 1.05–1.15.
174
+
175
+
176
+
177
+ ## ⚙️ Performance & Memory Guidance (Rules of Thumb)
178
+
179
+ - **RAM/VRAM for Q4_K_XL (20B):** ~12–16 GB for batch-1 inference (varies by backend and offloading).
180
+ - **Throughput:** Highly dependent on CPU/GPU, backend, context length, and GPU offload.
181
+ Start with **`-ngl`** as high as your VRAM allows, then tune threads/batch sizes.
182
+ - **Context window:** Do not exceed the base model’s maximum (quantization does not increase it).
183
+
184
+
185
+
186
+ ## 💻 Files
187
+
188
+ - `bubba-20b-Q4_K_XL.gguf` — 4-bit K-quantized weights (XL variant)
189
+ - `tokenizer.*` — packed inside GGUF (no separate files needed)
190
+
191
+ > **Integrity:** Verify your download (e.g., SHA256) if provided by the host/mirror.
192
+
193
+
194
+
195
+ ## ⚙️ GGUF Format
196
+
197
+ 1. Start from the base `gpt-oss-20b` weights (FP16/BF16).
198
+ 2. Convert to GGUF with `llama.cpp`’s `convert` tooling (or equivalent for the base arch).
199
+ 3. Quantize with `llama.cpp` `quantize` to **Q4_K_XL**.
200
+ 4. Sanity-check perplexity/behavior, package with metadata.
201
+
202
+ > Exact scripts/commits may vary by environment; please share your pipeline for full reproducibility
203
+ > if you fork this card.
204
+
205
+
206
+
207
+ ## 🏁 Safety, Bias & Responsible Use
208
+
209
+ Large language models can generate **plausible but incorrect or harmful** content and may reflect
210
+ **societal biases**. If you deploy this model:
211
+
212
+ - Add **moderation/guardrails** and domain-specific filters.
213
+ - Provide **user disclaimers** and feedback channels.
214
+ - Keep **human-in-the-loop** for consequential outputs.
215
+
216
+
217
+ ## 📝 License
218
+
219
+ - **Base model:** OpenAI `gpt-oss-20b` (as stated by the uploader). **You must review and follow
220
+ the base model’s license/terms and any OpenAI restrictions that apply.**
221
+ - **This quantized GGUF:** Distributed under the **same license/terms as the base model**.
222
+ No additional rights are granted.
223
+
224
+ If you are the rights holder and see an issue with distribution, please open an issue on this
225
+ repo/model card.
226
+
227
+
228
+
229
+ ## 🧩 Attribution
230
+
231
+ If this quant helped you, consider citing like:
232
+
233
+ ```
234
+ bubba-20b–Q4_K_XL.gguf (2025).
235
+ Quantized GGUF build derived from OpenAI’s gpt-oss-20b.
236
+ Retrieved from the Hugging Face Hub.
237
+ ```
238
+
239
+
240
+
241
+ ## ❓ FAQ
242
+
243
+ **Does quantization change the context window or tokenizer?**
244
+ No. Those are inherited from the base model; quantization only changes weight representation.
245
+
246
+ **Why am I hitting out-of-memory?**
247
+ Lower `-ngl` (fewer GPU layers), reduce context (`-c`), or switch to a smaller quant (e.g., Q3_K).
248
+ Ensure no other large models occupy VRAM.
249
+
250
+ **Best sampler settings?**
251
+ Start with temp 0.7, top_p 0.9, repeat_penalty 1.1.
252
+ Lower temperature for coding/planning; raise for creative writing.
253
+
254
+
255
+
256
+ ## 📝 Changelog
257
+
258
+ - **v1.0** — Initial release of `bubba-20b-Q4_K_XL.gguf`.
259
+
260
+
261
+
262
+ *Made with ❤️ by Bro — because the code (and the prompts) should just work.*
263
+