Bubba is a fine-tuned LLM based on OpenAI’s Chat GPT-5. This release packages the fine-tuned weights (or adapters) for practical, low-latency instruction following, summarization, reasoning, and light code generation. It is intended for local or self-hosted environments and RAG (Retrieval-Augmented Generation) stacks that require predictable, fast outputs.

Quantized, and fine-tuned GGUF based on OpenAI’s gpt-oss-20b
Format: GGUF (for llama.cpp and compatible runtimes) • Quantization: Q4_K_XL (4-bit, K-grouped, extra-low loss)
File: bubba-20b-Q4_K_XL.gguf

🧠 Overview

This repo provides a 4-bit K-quantized .gguf for fast local inference of a 20B-parameter model derived from OpenAI’s gpt-oss-20b (as reported by the uploader).
Use cases: general chat/instruction following, coding help, knowledge Q&A (see Intended Use & Limitations).
Works with: llama.cpp, llama-cpp-python, KoboldCPP, Text Generation WebUI, LM Studio, and other GGUF-compatible backends.
Hardware guidance (rule of thumb): ~12–16 GB VRAM/RAM for comfortable batch-1 inference with Q4_K_XL; CPU-only works too (expect lower tokens/s).

Key Features

Instruction-tuned derivative of gpt-oss-20b for concise, helpful responses.
Optimized defaults for short to medium prompts; strong compatibility with RAG pipelines.
Flexible distribution: full finetuned weights or lightweight LoRA/QLoRA adapters.
Compatible with popular runtimes and libraries (Transformers, PEFT, vLLM, Text Generation Inference).

⚠️ Provenance & license: This quant is produced from a base model claimed to be OpenAI’s gpt-oss-20b. Please review and comply with the original model’s license/terms. The GGUF quantization inherits those terms. See the License section.

⚙️ Vectorized Datasets

Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned. It can help improve the execution speed and reduce the training time of your code. BudgetPy provides the following vector stores on the OpenAI platform to support environmental data analysis with machine-learning

Appropriations - Enacted appropriations from 1996-2024 available for fine-tuning learning models
Regulations - Collection of federal regulations on the use of appropriated funds
SF-133 - The Report on Budget Execution and Budgetary Resources
Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
Outlays - The actual disbursements of funds by the U.S. federal government from 1962 to 2025
SF-133 The Report on Budget Execution and Budgetary Resources
Balances - U.S. federal agency Account Balances (File A) submitted as part of the DATA Act 2014.
Circular A11 - Guidance from OMB on the preparation, submission, and execution of the federal budget
Fastbook - Treasury guidance on federal ledger accounts
Title 31 CFR - Money & Finance
Redbook - The Principles of Appropriations Law (Volumes I & II).
US Standard General Ledger - Account Definitions
Treasury Appropriation Fund Symbols (TAFSs) Dataset - Collection of TAFSs used by federal agencies

Technical Specifications

Property	Value / Guidance
Base model	gpt-oss-20b (decoder-only Transformer)
Parameters	~20B (as per upstream)
Tokenizer	Use the upstream tokenizer associated with gpt-oss-20b
Context window	Determined by the upstream base; set accordingly in your runtime
Fine-tuning	Supervised Fine-Tuning (SFT); optional preference optimization (DPO/ORPO)
Precision	FP16/BF16 recommended; 4-bit (bnb) for single-GPU experimentation
Intended runtimes	Hugging Face Transformers, PEFT, vLLM, TGI (Text Generation Inference)

Note: Please adjust any specifics (context length, tokenizer name) to match the exact upstream build you use for gpt-oss-20b.

Files

File / Folder	Description
README.md	This model card
config.json / tokenizer files	Configuration and tokenizer artifacts (from upstream)
pytorch_model.safetensors	Full fine-tuned weights (if released as full model)
adapter_model.safetensors	LoRA/QLoRA adapters only (if released as adapters)
training_args.json (optional)	Minimal training configuration for reproducibility

Only one of “full weights” or “adapters” may be included depending on how you distribute Bubba.

📝 Intended Use & Limitations

Intended Use

Instruction following, general dialogue
Code assistance (reasoning, boilerplate, refactoring)
Knowledge/Q&A within the model’s training cutoff

Out-of-Scope / Known Limitations

Factuality: may produce inaccurate or outdated info
Safety: can emit biased or unsafe text; apply your own filters/guardrails
High-stakes decisions: not for medical, legal, financial, or safety-critical use

🎯 Quick Start

Examples: Using the Bubba LLM (Fine-tuned from gpt-oss-20b)

This guide shows several ways to run Bubba locally or on a server. Examples cover full weights, LoRA/QLoRA adapters, vLLM, and Text Generation Inference (TGI), plus prompt patterns and RAG.

🐍 Python (Transformers) — Full Weights

Install

pip install "transformers>=4.44.0" accelerate torch --upgrade

Load and generate

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "your-namespace/Bubba-gpt-oss-20b-finetuned"
tok = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "In 5 bullet points, explain retrieval-augmented generation and when to use it."
inputs = tok(prompt, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9
)
print(tok.decode(out[0], skip_special_tokens=True))

Notes

• device_map="auto" will place weights across available GPUs/CPU.
• Prefer BF16 if supported; otherwise FP16. For VRAM-constrained experiments, see 4-bit below.

🧩 Python (PEFT) — Adapters on Top of the Base

Install

pip install "transformers>=4.44.0" peft accelerate torch --upgrade

Load base + LoRA/QLoRA adapters

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_name = "openai/gpt-oss-20b"   # replace with the exact upstream base you use
lora_name = "your-namespace/Bubba-gpt-oss-20b-finetuned"

tok = AutoTokenizer.from_pretrained(base_name, use_fast=True)
base = AutoModelForCausalLM.from_pretrained(
    base_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base, lora_name)

prompt = "Draft a JSON spec with keys: goal, steps[], risks[], success_metric."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200, temperature=0.6, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))

💾 4-bit (bitsandbytes) — Memory-Efficient Loading

Install

pip install "transformers>=4.44.0" accelerate bitsandbytes --upgrade

Load with 4-bit quantization

    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
    import torch

    bnb = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )

    model_name = "your-namespace/Bubba-gpt-oss-20b-finetuned"
    tok = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb,
        device_map="auto"
    )

    prompt = "Explain beam search vs. nucleus sampling in three short bullets."
    inputs = tok(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
    out = model.generate(**inputs, max_new_tokens=160, temperature=0.7, top_p=0.9)
    print(tok.decode(out[0], skip_special_tokens=True))

🚀 Serve with vLLM (OpenAI-compatible API)

Install and launch (example)

    pip install vllm
    python -m vllm.entrypoints.openai.api_server \
      --model your-namespace/Bubba-gpt-oss-20b-finetuned \
      --dtype bfloat16 --max-model-len 8192 \
      --port 8000

Call the endpoint (Python)

    import requests, json
    url = "http://localhost:8000/v1/chat/completions"
    headers = {"Content-Type": "application/json"}

    data = {
      "model": "your-namespace/Bubba-gpt-oss-20b-finetuned",
      "messages": [
        {"role": "system", "content": "You are concise and factual."},
        {"role": "user", "content": "Give a 4-step checklist for evaluating a RAG pipeline."}
      ],
      "temperature": 0.7,
      "max_tokens": 256,
      "stream": True
    }

    with requests.post(url, headers=headers, data=json.dumps(data), stream=True) as r:
        for line in r.iter_lines():
            if line and line.startswith(b"data: "):
                chunk = line[len(b"data: "):].decode("utf-8")
                if chunk == "[DONE]":
                    break
                print(chunk, flush=True)

📦 Serve with Text Generation Inference (TGI)

Run the server (Docker)

docker run --gpus all --shm-size 1g -p 8080:80 \
  -e MODEL_ID=your-namespace/Bubba-gpt-oss-20b-finetuned \
  ghcr.io/huggingface/text-generation-inference:latest

Call the server (HTTP)

curl http://localhost:8080/generate \
  -X POST -d '{
    "inputs": "Summarize pros/cons of hybrid search (BM25 + embeddings).",
    "parameters": {"max_new_tokens": 200, "temperature": 0.7, "top_p": 0.9}
  }' \
  -H "Content-Type: application/json"

🧠 Prompt Patterns

Direct instruction (concise)

You are a precise assistant. In 6 bullets, explain evaluation metrics for retrieval (Recall@k,
MRR, nDCG). Keep each bullet under 20 words.

Constrained JSON output

System: Output only valid JSON. No prose.
User: Produce {"goal":"", "steps":[""], "risks":[""], "metrics":[""]} for testing a QA bot.

Guarded answer

If the answer isn’t derivable from the context, say “I don’t know” and ask for the missing info.

Few-shot structure

Example:
Q: Map 3 tasks to suitable embedding dimensions.
A: 256: short titles; 768: support FAQs; 1024: multi-paragraph knowledge base.

📚 Basic RAG

# 1) Retrieve
chunks = retriever.search("compare vector DBs for legal discovery", k=5)

# 2) Build prompt
context = "\n".join([f"• {c.text} [{c.source}]" for c in chunks])
prompt = f"""
You are a helpful assistant. Use only the context to answer.
Context:
{context}

Question:
What selection criteria should teams use when picking a vector DB for scale and cost?
"""

# 3) Generate (Transformers / vLLM / TGI)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, temperature=0.6, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))

📁 1. Document Ingestion

from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader("docs/corpus.txt")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=150)
docs = splitter.split_documents(documents)

🔍 2. Embedding & Vector Indexing

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(docs, embedding)

🔄 3. Retrieval + Prompt Formatting

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
retrieved_docs = retriever.get_relevant_documents("What role does Bubba play in improving document QA?")

context = "\n\n".join([doc.page_content for doc in retrieved_docs])

prompt = f"""
You are Bubba, a reasoning-heavy assistant. Use only the context below to answer:

<context>
{context}
</context>

<question>
What role does Bubba play in improving document QA?
</question>
"""

🧠 4. LLM Inference with Bubba

./main -m Bubba.Q4_K_M.gguf -p "$prompt" -n 768 -t 16 -c 4096 --color

Bubba’s output will include a context-aware, citation-grounded response backed by the retrieved input.

📝 Notes

Bubba (20B parameter model) may require more memory than smaller models like Bro or Leeroy.
Use a higher -c value (context size) to accommodate longer prompts with more chunks.
GPU acceleration is recommended for smooth generation if your hardware supports it.

⚙️ Parameter Tips

• Temperature: 0.6–0.9 (lower = more deterministic)
• Top-p: 0.8–0.95 (tune one knob at a time)
• Max new tokens: 128–384 for chat; longer for drafting
• Repetition penalty: 1.05–1.2 if loops appear
• Batch size: use padding_side="left" and dynamic padding for throughput
• Context length: set to your runtime’s max; compress context via selective retrieval

🛟 Troubleshooting

• CUDA OOM:
  Lower max_new_tokens; enable 4-bit; shard across GPUs; reduce context length.
• Slow throughput:
  Use vLLM/TGI with tensor/PP sharding; enable paged attention; pin to BF16.
• Messy JSON:
  Use a JSON-only system prompt; set temperature ≤0.6; add a JSON schema in the prompt.
• Domain shift:
  Consider small adapter tuning on your domain data; add retrieval grounding.

🔍 Minimal Batch Inference Example

prompts = [
  "List 5 key features of FAISS.",
  "Why would I choose pgvector over Milvus?"
]
inputs = tok(prompts, return_tensors="pt", padding=True).to(model.device)
out = model.generate(**inputs, max_new_tokens=160, temperature=0.7, top_p=0.9)
for i, seq in enumerate(out):
    print(f"--- Prompt {i+1} ---")
    print(tok.decode(seq, skip_special_tokens=True))

Inference Tips

Prefer BF16 if available; otherwise FP16. For limited VRAM, try 4-bit (bitsandbytes) to explore.
Start with max_new_tokens between 128–384 and temperature 0.6–0.9; tune top_p for stability.
For RAG, constrain prompt length and adopt strict chunking/citation formatting for better grounding.

📘 WebUI

Place the GGUF in text-generation-webui/models/bubba-20b-Q4_K_XL/
Launch with the llama.cpp loader (or llama-cpp-python backend)
Select the model in the UI, adjust context length, GPU layers, and sampling

🧩 KoboldCPP

./koboldcpp \
  -m bubba-20b-Q4_K_XL.gguf \
  --contextsize 4096 \
  --gpulayers 35 \
  --usecublas

⚡ LM Studio

Open LM Studio → Models → Local models → Add local model and select the .gguf.
In Chat, pick the model, set Context length (≤ base model max), and adjust GPU Layers.
For API use, enable Local Server and target the exposed endpoint with OpenAI-compatible clients.

❓ Prompting

This build is instruction-tuned (downstream behavior depends on your base). Common prompt patterns work:

Simple instruction

Write a concise summary of the benefits of grouped 4-bit quantization.

ChatML-like

<|system|>
You are a helpful, concise assistant.
<|user|>
Compare Q4_K_XL vs Q5_K_M in terms of quality and RAM.
<|assistant|>

Code task

Task: Write a Python function that computes perplexity given log-likelihoods.
Constraints: Include docstrings and type hints.

Tip: Keep prompts explicit and structured (roles, constraints, examples).
Suggested starting points: temperature 0.2–0.8, top_p 0.8–0.95, repeat_penalty 1.05–1.15.

No special chat template is strictly required. Use clear instructions and keep prompts concise. For multi-turn workflows, persist conversation state externally or via your app’s memory/RAG layer.

Example system style

You are a concise, accurate assistant. Prefer step-by-step reasoning only when needed.
Cite assumptions and ask for missing constraints.

Guro is a prompt library designed to supercharge AI agents and assistants with task-specific personas -ie, total randos.
From academic writing to financial analysis, technical support, SEO, and beyond
Guro provides precision-crafted prompt templates ready to drop into your LLM workflows.

⚙️ Performance & Memory Guidance (Rules of Thumb)

RAM/VRAM for Q4_K_XL (20B): ~12–16 GB for batch-1 inference (varies by backend and offloading).
Throughput: Highly dependent on CPU/GPU, backend, context length, and GPU offload.
Start with -ngl as high as your VRAM allows, then tune threads/batch sizes.
Context window: Do not exceed the base model’s maximum (quantization does not increase it).

💻 Files

bubba-20b-Q4_K_XL.gguf — 4-bit K-quantized weights (XL variant)
tokenizer.* — packed inside GGUF (no separate files needed)

Integrity: Verify your download (e.g., SHA256) if provided by the host/mirror.

⚙️ GGUF Format

Start from the base gpt-oss-20b weights (FP16/BF16).
Convert to GGUF with llama.cpp’s convert tooling (or equivalent for the base arch).
Quantize with llama.cpp quantize to Q4_K_XL.
Sanity-check perplexity/behavior, package with metadata.

Exact scripts/commits may vary by environment; please share your pipeline for full reproducibility if you fork this card.

🏁 Safety, Bias & Responsible Use

Large language models can generate plausible but incorrect or harmful content and may reflect societal biases. If you deploy this model:

Add moderation/guardrails and domain-specific filters.
Provide user disclaimers and feedback channels.
Keep human-in-the-loop for consequential outputs.

🕒 License and Usage

This model package derives from Chat GPT-5 so you're responsible for ensuring your use complies with the upstream model license and any dataset terms. For commercial deployment, review OpenAI’s license and your organization’s compliance requirements.

Bubba is published under the MIT General Public License v3

🧩 Attribution

If this quant helped you, consider citing like:

bubba-20b–Q4_K_XL.gguf (2025).
Quantized GGUF build derived from OpenAI’s gpt-oss-20b.
Retrieved from the Hugging Face Hub.

❓ FAQ

Does quantization change the context window or tokenizer?
No. Those are inherited from the base model; quantization only changes weight representation.

Why am I hitting out-of-memory?
Lower -ngl (fewer GPU layers), reduce context (-c), or switch to a smaller quant (e.g., Q3_K). Ensure no other large models occupy VRAM.

Best sampler settings?
Start with temp 0.7, top_p 0.9, repeat_penalty 1.1.
Lower temperature for coding/planning; raise for creative writing.

📝 Changelog

v1.0 — Initial release of bubba-20b-Q4_K_XL.gguf.

🧠 Overview

Key Features

⚙️ Vectorized Datasets

Technical Specifications

Files

📝 Intended Use & Limitations

Intended Use

Out-of-Scope / Known Limitations

🎯 Quick Start

Examples: Using the Bubba LLM (Fine-tuned from gpt-oss-20b)

🐍 Python (Transformers) — Full Weights

🧩 Python (PEFT) — Adapters on Top of the Base

💾 4-bit (bitsandbytes) — Memory-Efficient Loading

🚀 Serve with vLLM (OpenAI-compatible API)

📦 Serve with Text Generation Inference (TGI)

🧠 Prompt Patterns

📚 Basic RAG

📁 1. Document Ingestion

🔍 2. Embedding & Vector Indexing

🔄 3. Retrieval + Prompt Formatting

🧠 4. LLM Inference with Bubba

📝 Notes

⚙️ Parameter Tips

🛟 Troubleshooting

🔍 Minimal Batch Inference Example

Inference Tips

📘 WebUI

🧩 KoboldCPP

⚡ LM Studio

❓ Prompting

⚙️ Performance & Memory Guidance (Rules of Thumb)

💻 Files

⚙️ GGUF Format

🏁 Safety, Bias & Responsible Use

🕒 License and Usage

🧩 Attribution

❓ FAQ

📝 Changelog

Model tree for leeroy-jankins/bubba

Datasets used to train leeroy-jankins/bubba