Instructions to use darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic")
model = AutoModelForImageTextToText.from_pretrained("darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic

SGLang

How to use darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Unsloth Studio new

How to use darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic",
    max_seq_length=2048,
)

Docker Model Runner
How to use darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic with Docker Model Runner:
```
docker model run hf.co/darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

This is a decensored version of TeichAI/gemma-4-31B-it-Claude-Opus-Distill-v2, made using Heretic v1.3.0

This model is reproducible!

See the README in the reproduce directory for more information.

Abliteration parameters

Parameter	Value
direction_index	34.68
attn.o_proj.max_weight	1.49
attn.o_proj.max_weight_position	35.77
attn.o_proj.min_weight	0.58
attn.o_proj.min_weight_distance	34.11
mlp.down_proj.max_weight	1.50
mlp.down_proj.max_weight_position	36.82
mlp.down_proj.min_weight	1.28
mlp.down_proj.min_weight_distance	18.49

Performance

Metric	This model	Original model (TeichAI/gemma-4-31B-it-Claude-Opus-Distill-v2)
KL divergence	0.0063	0 (by definition)
Refusals	28/100	89/100

🌟 Gemma 4 - 31B x Claude Opus 4.6 v2

Build Environment & Features:

Fine-tuning Framework: Unsloth

Reasoning Effort: High

This model bridges the gap between Google's exceptional open-weights architecture and Claude 4.6's profound reasoning capabilities, leveraging cutting-edge fine-tuning environments.

💡 Model Introduction

Gemma 4 - 31B x Claude Opus 4.6 is a highly capable model fine-tuned on top of the powerful unsloth/gemma-4-31B-it architecture. The model's core directive is to absorb state-of-the-art reasoning distillation, primarily sourced from Claude-4.6 Opus interactions.

By utilizing datasets where the reasoning effort was explicitly set to High, this model excels in breaking down complex problems and delivering precise, nuanced solutions across a variety of demanding domains.

🗺️ Training Pipeline Overview

Base Model (unsloth/gemma-4-31B-it)
 │
 ▼
Supervised Fine-Tuning (SFT) + High-Effort Reasoning Datasets
 │
 ▼
Final Model (Gemma 4 - 31B x Claude Opus 4.6)

📋 Stage Details & Benchmarks

Benchmarks coming soon

Performance vs Size:

Deep Dive Analysis: For more comprehensive insights regarding the base capabilities of the Gemma 4 architecture, please refer to this Analysis Document.

🔹 Supervised Fine-Tuning (Meeting Claude)

Objective: To inject high-density reasoning logic and establish a strict format for complex problem-solving.
Methodology: We utilized Unsloth for highly efficient memory and compute optimization during the fine-tuning process. The model was trained extensively on various reasoning trajectories from Claude Opus 4.6 to adopt a structured and efficient thinking pattern.

📚 All Datasets Used

The dataset consists of high-quality, high-effort reasoning distillation data:

Dataset Name	Description / Purpose
`TeichAI/Claude-Opus-4.6-Reasoning-887x`	Core Claude 4.6 Opus reasoning trajectories.
`TeichAI/claude-4.5-opus-high-reasoning-250x`	High-intensity reasoning distillation.
`Crownelius/Opus-4.6-Reasoning-2100x-formatted`	Crownelius's extensively formatted Opus reasoning dataset for structural reinforcement.

🌟 Core Skills & Capabilities

Thanks to its robust base model and high-effort reasoning distillation, this model is highly optimized for the following use cases:

💻 Coding: Advanced code generation, debugging, and software architecture planning.
🔬 Science: Deep scientific reasoning, hypothesis evaluation, and analytical problem-solving.
🔎 Deep Research: Navigating complex, multi-step research queries and synthesizing vast amounts of information.
🧠 General Purpose: Highly capable instruction-following for everyday tasks requiring high logical coherence.

Getting Started

You can use all Gemma 4 models with the latest version of Transformers. To get started, install the necessary dependencies in your environment:

pip install -U transformers torch accelerate

Once you have everything installed, you can proceed to load the model with the code below:

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "google/gemma-4-31B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)

Once the model is loaded, you can start generating output:

# Prompt
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short joke about saving RAM."},
]

# Process input
text = processor.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True, 
    enable_thinking=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
processor.parse_response(response)

To enable reasoning, set enable_thinking=True and the parse_response function will take care of parsing the thinking output.

Below, you will also find snippets for processing audio (E2B and E4B only), images, and video alongside text:

Code for processing Audio

Instead of using AutoModelForCausalLM, you can use AutoModelForMultimodalLM to process audio. To use it, make sure to install the following packages:

pip install -U transformers torch librosa accelerate

You can then load the model with the code below:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-E2B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID, 
    dtype="auto", 
    device_map="auto"
)

Once the model is loaded, you can start generating output by directly referencing the audio URL in the prompt:

# Prompt - add audio before text
messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/journal1.wav"},
            {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
        ]
    }
]

# Process input
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
processor.parse_response(response)

Code for processing Images

Instead of using AutoModelForCausalLM, you can use AutoModelForMultimodalLM to process images. To use it, make sure to install the following packages:

pip install -U transformers torch torchvision accelerate

You can then load the model with the code below:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-31B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID, 
    dtype="auto", 
    device_map="auto"
)

Once the model is loaded, you can start generating output by directly referencing the image URL in the prompt:

# Prompt - add image before text
messages = [
    {
        "role": "user", "content": [
            {"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/GoldenGate.png"},
            {"type": "text", "text": "What is shown in this image?"}
        ]
    }
]

# Process input
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
processor.parse_response(response)

Code for processing Videos

Instead of using AutoModelForCausalLM, you can use AutoModelForMultimodalLM to process videos. To use it, make sure to install the following packages:

pip install -U transformers torch torchvision torchcodec librosa accelerate

You can then load the model with the code below:

from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-31B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID, 
    dtype="auto", 
    device_map="auto"
)

Once the model is loaded, you can start generating output by directly referencing the video URL in the prompt:

# Prompt - add video before text
messages = [
    {
        'role': 'user',
        'content': [
            {"type": "video", "video": "https://github.com/bebechien/gemma/raw/refs/heads/main/videos/ForBiggerBlazes.mp4"},
            {'type': 'text', 'text': 'Describe this video.'}
        ]
    }
]

# Process input
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
processor.parse_response(response)

Best Practices

For the best performance, use these configurations and best practices:

1. Sampling Parameters

Use the following standardized sampling configuration across all use cases:

temperature=1.0
top_p=0.95
top_k=64

2. Thinking Mode Configuration

Compared to Gemma 3, the models use standard system, assistant, and user roles. To properly manage the thinking process, use the following control tokens:

Trigger Thinking: Thinking is enabled by including the <|think|> token at the start of the system prompt. To disable thinking, remove the token.
Standard Generation: When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:
<|channel>thought\n[Internal reasoning]<channel|>
Disabled Thinking Behavior: For all models except for the E2B and E4B variants, if thinking is disabled, the model will still generate the tags but with an empty thought block:
<|channel>thought\n<channel|>[Final answer]

Note that many libraries like Transformers and llama.cpp handle the complexities of the chat template for you.

3. Multi-Turn Conversations

No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins.

4. Modality order

For optimal performance with multimodal inputs, place image and/or audio content before the text in your prompt.

5. Variable Image Resolution

Aside from variable aspect ratios, Gemma 4 supports variable image resolution through a configurable visual token budget, which controls how many tokens are used to represent an image. A higher token budget preserves more visual detail at the cost of additional compute, while a lower budget enables faster inference for tasks that don't require fine-grained understanding.

The supported token budgets are: 70, 140, 280, 560, and 1120.
- Use lower budgets for classification, captioning, or video understanding, where faster inference and processing many frames outweigh fine-grained detail.
- Use higher budgets for tasks like OCR, document parsing, or reading small text.

6. Audio

Use the following prompt structures for audio processing:

Audio Speech Recognition (ASR)

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

Automatic Speech Translation (AST)

Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

7. Audio and Video Length

All models support image inputs and can process videos as frames whereas the E2B and E4B models also support audio inputs. Audio supports a maximum length of 30 seconds. Video supports a maximum of 60 seconds assuming the images are processed at one frame per second.

🙏 Acknowledgements

Google: For providing an exceptional open weights model. Read more about Gemma 4 on the Google Innovation Blog.
Unsloth: For assembling ready-to-use, cutting-edge fine-tuning environments that make this work possible.
Crownelius: For creating and sharing his awesome Opus reasoning dataset with the community.

📖 Citation

If you use this model in your research or projects, please cite:

@misc{teichai_gemma4_31b_opus_distilled_v2,
  title        = {Gemma-4-31B-it-Claude-Opus-Distill-v2},
  author       = {TeichAI},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/TeichAI/gemma-4-31B-it-Claude-Opus-Distill-v2}}
}

Downloads last month: 140

Safetensors

Model size

31B params

Tensor type

BF16

Model tree for darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic

Base model

google/gemma-4-31B

Finetuned

google/gemma-4-31B-it

Finetuned

unsloth/gemma-4-31B-it

Finetuned

(21)

this model

Quantizations

4 models

darkc0de
/

gemma-4-31B-it-Claude-Opus-Distill-v2-heretic

This is a decensored version of TeichAI/gemma-4-31B-it-Claude-Opus-Distill-v2, made using Heretic v1.3.0

Abliteration parameters

Performance

🌟 Gemma 4 - 31B x Claude Opus 4.6 v2

💡 Model Introduction

🗺️ Training Pipeline Overview

📋 Stage Details & Benchmarks

🔹 Supervised Fine-Tuning (Meeting Claude)

📚 All Datasets Used

🌟 Core Skills & Capabilities

Getting Started

Best Practices

1. Sampling Parameters

2. Thinking Mode Configuration

3. Multi-Turn Conversations

4. Modality order

5. Variable Image Resolution

6. Audio

7. Audio and Video Length

🙏 Acknowledgements

📖 Citation

Model tree for darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic

Datasets used to train darkc0de/gemma-4-31B-it-Claude-Opus-Distill-v2-heretic