leeroy-jankins
/

bubba

+---
+license: mit
+---
+###### bubba
+**Quantized, and fine-tuned GGUF based on OpenAI’s `gpt-oss-20b`**
+Format: **GGUF** (for `llama.cpp` and compatible runtimes) • Quantization: **Q4_K_XL (4-bit, K-grouped, extra-low loss)**
+File: `bubba-20b-Q4_K_XL.gguf`
+## 🧠 Overview
+- This repo provides a **4-bit K-quantized** `.gguf` for fast local inference of a 20B-parameter model
+  derived from **OpenAI’s `gpt-oss-20b`** (as reported by the uploader).
+- **Use cases:** general chat/instruction following, coding help, knowledge Q&A
+  (see Intended Use & Limitations).
+- **Works with:** `llama.cpp`, `llama-cpp-python`, KoboldCPP, Text Generation WebUI, LM Studio,
+  and other GGUF-compatible backends.
+- **Hardware guidance (rule of thumb):** ~12–16 GB VRAM/RAM for comfortable batch-1 inference
+  with Q4_K_XL; CPU-only works too (expect lower tokens/s).
+> ⚠️ **Provenance & license**: This quant is produced from a base model claimed to be OpenAI’s
+> `gpt-oss-20b`. Please **review and comply with the original model’s license/terms**. The GGUF
+> quantization **inherits** those terms. See the **License** section.
+## ❓ What’s Inside
+- **Architecture:** decoder-only transformer (20B params)
+- **Context window:** inherits base model’s max context (not changed by quantization)
+- **Vocabulary & tokenizer:** inherited from base model (packed into GGUF)
+- **Quantization:** `Q4_K_XL` (grouped 4-bit with outlier handling; designed to preserve quality
+  versus classic Q4)
+#### Why Q4_K_XL?
+- **Quality vs size:** better perplexity retention than older 4-bit schemes while remaining small.
+- **Speed:** efficient on modern CPUs (AVX2/AVX-512) and GPUs via `llama.cpp` backends.
+- **When to pick it:** balanced default for laptops/desktops when 5- or 6-bit variants are too large.
+## 📝 Intended Use & Limitations
+### Intended Use
+- Instruction following, general dialogue
+- Code assistance (reasoning, boilerplate, refactoring)
+- Knowledge/Q&A within the model’s training cutoff
+### Out-of-Scope / Known Limitations
+- **Factuality:** may produce inaccurate or outdated info
+- **Safety:** can emit biased or unsafe text; **apply your own filters/guardrails**
+- **High-stakes decisions:** not for medical, legal, financial, or safety-critical use
+## 🎯 Quick Start
+### Run with `llama.cpp` (CLI)
+```bash
+# Download llama.cpp, build it, then run:
+# -m : path to this GGUF file
+# -ngl: how many layers to offload to GPU (0 = CPU-only). Adjust to your VRAM.
+# -c : context tokens (do not exceed base model’s max)
+# -t : CPU threads
+./main \
+  -m ./bubba-20b-Q4_K_XL.gguf \
+  -p "Write a short haiku about quantization." \
+  -n 128 -t 8 -c 4096 -ngl 35
+```
+### Run as a server (OpenAI-style) with `llama.cpp`
+```bash
+# Starts a local HTTP server at http://localhost:8080/v1
+./server -m ./bubba-20b-Q4_K_XL.gguf -c 4096 -ngl 35
+```
+Then call it:
+```bash
+# Minimal curl example against llama.cpp server
+curl http://localhost:8080/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+        "model": "bubba-20b-Q4_K_XL.gguf",
+        "prompt": "Explain Q4_K_XL quantization in one paragraph.",
+        "max_tokens": 128
+      }'
+```
+### Python (`llama-cpp-python`)
+```python
+"""
+Example: Minimal Python inference with llama-cpp-python.
+Prereqs:
+  pip install llama-cpp-python  # Optionally with CUDA/Metal extras for GPU
+Notes:
+  - Set n_ctx <= base model's context length.
+  - n_gpu_layers > 0 to offload layers to GPU (if available).
+"""
+from llama_cpp import Llama
+llm = Llama(
+    model_path="bubba-20b-Q4_K_XL.gguf",
+    n_ctx=4096,
+    n_threads=8,
+    n_gpu_layers=35  # set to 0 for CPU-only
+)
+prompt = "You are a helpful assistant. In 3 bullet points, define K-quantization."
+out = llm(prompt, max_tokens=160, temperature=0.7, top_p=0.9)
+print(out["choices"][0]["text"])
+```
+## 📘 Text Generation WebUI
+- Place the GGUF in `text-generation-webui/models/bubba-20b-Q4_K_XL/`
+- Launch with the `llama.cpp` loader (or `llama-cpp-python` backend)
+- Select the model in the UI, adjust **context length**, **GPU layers**, and **sampling**
+## 🧩 KoboldCPP
+```bash
+./koboldcpp \
+  -m bubba-20b-Q4_K_XL.gguf \
+  --contextsize 4096 \
+  --gpulayers 35 \
+  --usecublas
+```
+## ⚡ LM Studio
+1. Open **LM Studio** → **Models** → **Local models** → **Add local model** and select the `.gguf`.
+2. In **Chat**, pick the model, set **Context length** (≤ base model max), and adjust **GPU Layers**.
+3. For API use, enable **Local Server** and target the exposed endpoint with OpenAI-compatible clients.
+## ❓ Prompting
+This build is instruction-tuned (downstream behavior depends on your base). Common prompt patterns work:
+**Simple instruction**
+```
+Write a concise summary of the benefits of grouped 4-bit quantization.
+```
+**ChatML-like**
+```
+<|system|>
+You are a helpful, concise assistant.
+<|user|>
+Compare Q4_K_XL vs Q5_K_M in terms of quality and RAM.
+<|assistant|>
+```
+**Code task**
+```
+Task: Write a Python function that computes perplexity given log-likelihoods.
+Constraints: Include docstrings and type hints.
+```
+> **Tip:** Keep prompts **explicit and structured** (roles, constraints, examples).
+> Suggested starting points: temperature 0.2–0.8, top_p 0.8–0.95, repeat_penalty 1.05–1.15.
+## ⚙️ Performance & Memory Guidance (Rules of Thumb)
+- **RAM/VRAM for Q4_K_XL (20B):** ~12–16 GB for batch-1 inference (varies by backend and offloading).
+- **Throughput:** Highly dependent on CPU/GPU, backend, context length, and GPU offload.
+  Start with **`-ngl`** as high as your VRAM allows, then tune threads/batch sizes.
+- **Context window:** Do not exceed the base model’s maximum (quantization does not increase it).
+## 💻 Files
+- `bubba-20b-Q4_K_XL.gguf` — 4-bit K-quantized weights (XL variant)
+- `tokenizer.*` — packed inside GGUF (no separate files needed)
+> **Integrity:** Verify your download (e.g., SHA256) if provided by the host/mirror.
+## ⚙️ GGUF Format
+1. Start from the base `gpt-oss-20b` weights (FP16/BF16).
+2. Convert to GGUF with `llama.cpp`’s `convert` tooling (or equivalent for the base arch).
+3. Quantize with `llama.cpp` `quantize` to **Q4_K_XL**.
+4. Sanity-check perplexity/behavior, package with metadata.
+> Exact scripts/commits may vary by environment; please share your pipeline for full reproducibility
+> if you fork this card.
+## 🏁 Safety, Bias & Responsible Use
+Large language models can generate **plausible but incorrect or harmful** content and may reflect
+**societal biases**. If you deploy this model:
+- Add **moderation/guardrails** and domain-specific filters.
+- Provide **user disclaimers** and feedback channels.
+- Keep **human-in-the-loop** for consequential outputs.
+## 📝 License
+- **Base model:** OpenAI `gpt-oss-20b` (as stated by the uploader). **You must review and follow
+  the base model’s license/terms and any OpenAI restrictions that apply.**
+- **This quantized GGUF:** Distributed under the **same license/terms as the base model**.
+  No additional rights are granted.
+If you are the rights holder and see an issue with distribution, please open an issue on this
+repo/model card.
+## 🧩 Attribution
+If this quant helped you, consider citing like:
+```
+bubba-20b–Q4_K_XL.gguf (2025).
+Quantized GGUF build derived from OpenAI’s gpt-oss-20b.
+Retrieved from the Hugging Face Hub.
+```
+## ❓ FAQ
+**Does quantization change the context window or tokenizer?**
+No. Those are inherited from the base model; quantization only changes weight representation.
+**Why am I hitting out-of-memory?**
+Lower `-ngl` (fewer GPU layers), reduce context (`-c`), or switch to a smaller quant (e.g., Q3_K).
+Ensure no other large models occupy VRAM.
+**Best sampler settings?**
+Start with temp 0.7, top_p 0.9, repeat_penalty 1.1.
+Lower temperature for coding/planning; raise for creative writing.
+## 📝 Changelog
+- **v1.0** — Initial release of `bubba-20b-Q4_K_XL.gguf`.
+*Made with ❤️ by Bro — because the code (and the prompts) should just work.*