
Bubba is a fine-tuned LLM based on OpenAI’s Chat GPT-5. This release packages the fine-tuned weights (or adapters) for practical, low-latency instruction following, summarization, reasoning, and light code generation. It is intended for local or self-hosted environments and RAG (Retrieval-Augmented Generation) stacks that require predictable, fast outputs.
Quantized, and fine-tuned GGUF based on OpenAI’s gpt-oss-20b
Format: GGUF (for llama.cpp
and compatible runtimes) • Quantization: Q4_K_XL (4-bit, K-grouped, extra-low loss)
File: bubba-20b-Q4_K_XL.gguf
🧠 Overview
- This repo provides a 4-bit K-quantized
.gguf
for fast local inference of a 20B-parameter model derived from OpenAI’sgpt-oss-20b
(as reported by the uploader). - Use cases: general chat/instruction following, coding help, knowledge Q&A (see Intended Use & Limitations).
- Works with:
llama.cpp
,llama-cpp-python
, KoboldCPP, Text Generation WebUI, LM Studio, and other GGUF-compatible backends. - Hardware guidance (rule of thumb): ~12–16 GB VRAM/RAM for comfortable batch-1 inference with Q4_K_XL; CPU-only works too (expect lower tokens/s).
Key Features
- Instruction-tuned derivative of gpt-oss-20b for concise, helpful responses.
- Optimized defaults for short to medium prompts; strong compatibility with RAG pipelines.
- Flexible distribution: full finetuned weights or lightweight LoRA/QLoRA adapters.
- Compatible with popular runtimes and libraries (Transformers, PEFT, vLLM, Text Generation Inference).
⚠️ Provenance & license: This quant is produced from a base model claimed to be OpenAI’s
gpt-oss-20b
. Please review and comply with the original model’s license/terms. The GGUF quantization inherits those terms. See the License section.
⚙️ Vectorized Datasets
Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned. It can help improve the execution speed and reduce the training time of your code. BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning
- Appropriations - Enacted appropriations from 1996-2024 available for fine-tuning learning models
- Regulations - Collection of federal regulations on the use of appropriated funds
- SF-133 - The Report on Budget Execution and Budgetary Resources
- Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
- Outlays - The actual disbursements of funds by the U.S. federal government from 1962 to 2025
- SF-133 The Report on Budget Execution and Budgetary Resources
- Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
- Circular A11 - Guidance from OMB on the preparation, submission, and execution of the federal budget
- Fastbook - Treasury guidance on federal ledger accounts
- Title 31 CFR - Money & Finance
- Redbook - The Principles of Appropriations Law (Volumes I & II).
- US Standard General Ledger - Account Definitions
- Treasury Appropriation Fund Symbols (TAFSs) Dataset - Collection of TAFSs used by federal agencies
Technical Specifications
Property | Value / Guidance |
---|---|
Base model | gpt-oss-20b (decoder-only Transformer) |
Parameters | ~20B (as per upstream) |
Tokenizer | Use the upstream tokenizer associated with gpt-oss-20b |
Context window | Determined by the upstream base; set accordingly in your runtime |
Fine-tuning | Supervised Fine-Tuning (SFT); optional preference optimization (DPO/ORPO) |
Precision | FP16/BF16 recommended; 4-bit (bnb) for single-GPU experimentation |
Intended runtimes | Hugging Face Transformers, PEFT, vLLM, TGI (Text Generation Inference) |
Note: Please adjust any specifics (context length, tokenizer name) to match the exact upstream build you use for gpt-oss-20b.
Files
File / Folder | Description |
---|---|
README.md | This model card |
config.json / tokenizer files | Configuration and tokenizer artifacts (from upstream) |
pytorch_model.safetensors | Full fine-tuned weights (if released as full model) |
adapter_model.safetensors | LoRA/QLoRA adapters only (if released as adapters) |
training_args.json (optional) | Minimal training configuration for reproducibility |
Only one of “full weights” or “adapters” may be included depending on how you distribute Bubba.
📝 Intended Use & Limitations
Intended Use
- Instruction following, general dialogue
- Code assistance (reasoning, boilerplate, refactoring)
- Knowledge/Q&A within the model’s training cutoff
Out-of-Scope / Known Limitations
- Factuality: may produce inaccurate or outdated info
- Safety: can emit biased or unsafe text; apply your own filters/guardrails
- High-stakes decisions: not for medical, legal, financial, or safety-critical use
🎯 Quick Start
Examples: Using the Bubba LLM (Fine-tuned from gpt-oss-20b)
This guide shows several ways to run Bubba locally or on a server. Examples cover full weights, LoRA/QLoRA adapters, vLLM, and Text Generation Inference (TGI), plus prompt patterns and RAG.
🐍 Python (Transformers) — Full Weights
Install
pip install "transformers>=4.44.0" accelerate torch --upgrade
Load and generate
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "your-namespace/Bubba-gpt-oss-20b-finetuned"
tok = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = "In 5 bullet points, explain retrieval-augmented generation and when to use it."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9
)
print(tok.decode(out[0], skip_special_tokens=True))
Notes
• device_map="auto" will place weights across available GPUs/CPU.
• Prefer BF16 if supported; otherwise FP16. For VRAM-constrained experiments, see 4-bit below.
🧩 Python (PEFT) — Adapters on Top of the Base
Install
pip install "transformers>=4.44.0" peft accelerate torch --upgrade
Load base + LoRA/QLoRA adapters
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_name = "openai/gpt-oss-20b" # replace with the exact upstream base you use
lora_name = "your-namespace/Bubba-gpt-oss-20b-finetuned"
tok = AutoTokenizer.from_pretrained(base_name, use_fast=True)
base = AutoModelForCausalLM.from_pretrained(
base_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
model = PeftModel.from_pretrained(base, lora_name)
prompt = "Draft a JSON spec with keys: goal, steps[], risks[], success_metric."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200, temperature=0.6, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
💾 4-bit (bitsandbytes) — Memory-Efficient Loading
Install
pip install "transformers>=4.44.0" accelerate bitsandbytes --upgrade
Load with 4-bit quantization
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model_name = "your-namespace/Bubba-gpt-oss-20b-finetuned"
tok = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb,
device_map="auto"
)
prompt = "Explain beam search vs. nucleus sampling in three short bullets."
inputs = tok(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
out = model.generate(**inputs, max_new_tokens=160, temperature=0.7, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
🚀 Serve with vLLM (OpenAI-compatible API)
Install and launch (example)
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model your-namespace/Bubba-gpt-oss-20b-finetuned \
--dtype bfloat16 --max-model-len 8192 \
--port 8000
Call the endpoint (Python)
import requests, json
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "your-namespace/Bubba-gpt-oss-20b-finetuned",
"messages": [
{"role": "system", "content": "You are concise and factual."},
{"role": "user", "content": "Give a 4-step checklist for evaluating a RAG pipeline."}
],
"temperature": 0.7,
"max_tokens": 256,
"stream": True
}
with requests.post(url, headers=headers, data=json.dumps(data), stream=True) as r:
for line in r.iter_lines():
if line and line.startswith(b"data: "):
chunk = line[len(b"data: "):].decode("utf-8")
if chunk == "[DONE]":
break
print(chunk, flush=True)
📦 Serve with Text Generation Inference (TGI)
Run the server (Docker)
docker run --gpus all --shm-size 1g -p 8080:80 \
-e MODEL_ID=your-namespace/Bubba-gpt-oss-20b-finetuned \
ghcr.io/huggingface/text-generation-inference:latest
Call the server (HTTP)
curl http://localhost:8080/generate \
-X POST -d '{
"inputs": "Summarize pros/cons of hybrid search (BM25 + embeddings).",
"parameters": {"max_new_tokens": 200, "temperature": 0.7, "top_p": 0.9}
}' \
-H "Content-Type: application/json"
🧠 Prompt Patterns
Direct instruction (concise)
You are a precise assistant. In 6 bullets, explain evaluation metrics for retrieval (Recall@k,
MRR, nDCG). Keep each bullet under 20 words.
Constrained JSON output
System: Output only valid JSON. No prose.
User: Produce {"goal":"", "steps":[""], "risks":[""], "metrics":[""]} for testing a QA bot.
Guarded answer
If the answer isn’t derivable from the context, say “I don’t know” and ask for the missing info.
Few-shot structure
Example:
Q: Map 3 tasks to suitable embedding dimensions.
A: 256: short titles; 768: support FAQs; 1024: multi-paragraph knowledge base.
📚 Basic RAG
# 1) Retrieve
chunks = retriever.search("compare vector DBs for legal discovery", k=5)
# 2) Build prompt
context = "\n".join([f"• {c.text} [{c.source}]" for c in chunks])
prompt = f"""
You are a helpful assistant. Use only the context to answer.
Context:
{context}
Question:
What selection criteria should teams use when picking a vector DB for scale and cost?
"""
# 3) Generate (Transformers / vLLM / TGI)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, temperature=0.6, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
📁 1. Document Ingestion
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = TextLoader("docs/corpus.txt")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=150)
docs = splitter.split_documents(documents)
🔍 2. Embedding & Vector Indexing
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(docs, embedding)
🔄 3. Retrieval + Prompt Formatting
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
retrieved_docs = retriever.get_relevant_documents("What role does Bubba play in improving document QA?")
context = "\n\n".join([doc.page_content for doc in retrieved_docs])
prompt = f"""
You are Bubba, a reasoning-heavy assistant. Use only the context below to answer:
<context>
{context}
</context>
<question>
What role does Bubba play in improving document QA?
</question>
"""
🧠 4. LLM Inference with Bubba
./main -m Bubba.Q4_K_M.gguf -p "$prompt" -n 768 -t 16 -c 4096 --color
Bubba’s output will include a context-aware, citation-grounded response backed by the retrieved input.
📝 Notes
- Bubba (20B parameter model) may require more memory than smaller models like Bro or Leeroy.
- Use a higher
-c
value (context size) to accommodate longer prompts with more chunks. - GPU acceleration is recommended for smooth generation if your hardware supports it.
⚙️ Parameter Tips
• Temperature: 0.6–0.9 (lower = more deterministic)
• Top-p: 0.8–0.95 (tune one knob at a time)
• Max new tokens: 128–384 for chat; longer for drafting
• Repetition penalty: 1.05–1.2 if loops appear
• Batch size: use padding_side="left" and dynamic padding for throughput
• Context length: set to your runtime’s max; compress context via selective retrieval
🛟 Troubleshooting
• CUDA OOM:
Lower max_new_tokens; enable 4-bit; shard across GPUs; reduce context length.
• Slow throughput:
Use vLLM/TGI with tensor/PP sharding; enable paged attention; pin to BF16.
• Messy JSON:
Use a JSON-only system prompt; set temperature ≤0.6; add a JSON schema in the prompt.
• Domain shift:
Consider small adapter tuning on your domain data; add retrieval grounding.
🔍 Minimal Batch Inference Example
prompts = [
"List 5 key features of FAISS.",
"Why would I choose pgvector over Milvus?"
]
inputs = tok(prompts, return_tensors="pt", padding=True).to(model.device)
out = model.generate(**inputs, max_new_tokens=160, temperature=0.7, top_p=0.9)
for i, seq in enumerate(out):
print(f"--- Prompt {i+1} ---")
print(tok.decode(seq, skip_special_tokens=True))
Inference Tips
- Prefer BF16 if available; otherwise FP16. For limited VRAM, try 4-bit (bitsandbytes) to explore.
- Start with max_new_tokens between 128–384 and temperature 0.6–0.9; tune top_p for stability.
- For RAG, constrain prompt length and adopt strict chunking/citation formatting for better grounding.
📘 WebUI
- Place the GGUF in
text-generation-webui/models/bubba-20b-Q4_K_XL/
- Launch with the
llama.cpp
loader (orllama-cpp-python
backend) - Select the model in the UI, adjust context length, GPU layers, and sampling
🧩 KoboldCPP
./koboldcpp \
-m bubba-20b-Q4_K_XL.gguf \
--contextsize 4096 \
--gpulayers 35 \
--usecublas
⚡ LM Studio
- Open LM Studio → Models → Local models → Add local model and select the
.gguf
. - In Chat, pick the model, set Context length (≤ base model max), and adjust GPU Layers.
- For API use, enable Local Server and target the exposed endpoint with OpenAI-compatible clients.
❓ Prompting
This build is instruction-tuned (downstream behavior depends on your base). Common prompt patterns work:
Simple instruction
Write a concise summary of the benefits of grouped 4-bit quantization.
ChatML-like
<|system|>
You are a helpful, concise assistant.
<|user|>
Compare Q4_K_XL vs Q5_K_M in terms of quality and RAM.
<|assistant|>
Code task
Task: Write a Python function that computes perplexity given log-likelihoods.
Constraints: Include docstrings and type hints.
Tip: Keep prompts explicit and structured (roles, constraints, examples).
Suggested starting points: temperature 0.2–0.8, top_p 0.8–0.95, repeat_penalty 1.05–1.15.
- No special chat template is strictly required. Use clear instructions and keep prompts concise. For multi-turn workflows, persist conversation state externally or via your app’s memory/RAG layer.
Example system style
You are a concise, accurate assistant. Prefer step-by-step reasoning only when needed.
Cite assumptions and ask for missing constraints.
- Guro is a prompt library designed to supercharge AI agents and assistants with task-specific personas -ie, total randos.
- From academic writing to financial analysis, technical support, SEO, and beyond
- Guro provides precision-crafted prompt templates ready to drop into your LLM workflows.
⚙️ Performance & Memory Guidance (Rules of Thumb)
- RAM/VRAM for Q4_K_XL (20B): ~12–16 GB for batch-1 inference (varies by backend and offloading).
- Throughput: Highly dependent on CPU/GPU, backend, context length, and GPU offload.
Start with-ngl
as high as your VRAM allows, then tune threads/batch sizes. - Context window: Do not exceed the base model’s maximum (quantization does not increase it).
💻 Files
bubba-20b-Q4_K_XL.gguf
— 4-bit K-quantized weights (XL variant)tokenizer.*
— packed inside GGUF (no separate files needed)
Integrity: Verify your download (e.g., SHA256) if provided by the host/mirror.
⚙️ GGUF Format
- Start from the base
gpt-oss-20b
weights (FP16/BF16). - Convert to GGUF with
llama.cpp
’sconvert
tooling (or equivalent for the base arch). - Quantize with
llama.cpp
quantize
to Q4_K_XL. - Sanity-check perplexity/behavior, package with metadata.
Exact scripts/commits may vary by environment; please share your pipeline for full reproducibility if you fork this card.
🏁 Safety, Bias & Responsible Use
Large language models can generate plausible but incorrect or harmful content and may reflect societal biases. If you deploy this model:
- Add moderation/guardrails and domain-specific filters.
- Provide user disclaimers and feedback channels.
- Keep human-in-the-loop for consequential outputs.
🕒 License and Usage
This model package derives from Chat GPT-5 so you're responsible for ensuring your use complies with the upstream model license and any dataset terms. For commercial deployment, review OpenAI’s license and your organization’s compliance requirements.
- Bubba is published under the MIT General Public License v3
🧩 Attribution
If this quant helped you, consider citing like:
bubba-20b–Q4_K_XL.gguf (2025).
Quantized GGUF build derived from OpenAI’s gpt-oss-20b.
Retrieved from the Hugging Face Hub.
❓ FAQ
Does quantization change the context window or tokenizer?
No. Those are inherited from the base model; quantization only changes weight representation.
Why am I hitting out-of-memory?
Lower -ngl
(fewer GPU layers), reduce context (-c
), or switch to a smaller quant (e.g., Q3_K).
Ensure no other large models occupy VRAM.
Best sampler settings?
Start with temp 0.7, top_p 0.9, repeat_penalty 1.1.
Lower temperature for coding/planning; raise for creative writing.
📝 Changelog
- v1.0 — Initial release of
bubba-20b-Q4_K_XL.gguf
.
- Downloads last month
- 2,427
4-bit
Model tree for leeroy-jankins/bubba
Base model
openai/gpt-oss-20b