nanochat-d32 Deception Behavioral SAEs

57 Sparse Autoencoders trained on residual stream activations from karpathy/nanochat-d32 (1.88B parameter GPT-NeoX base model), capturing behavioral deception signals via same-prompt temperature sampling.

Training-data caveat — please read before use

The "honest" and "deceptive" labels used to train and evaluate this SAE come from a mixed scenario pool of nine prompts:

Six clean incentive-structure scenarios — insider_info, accounting_error, ai_oversight_log, ai_capability_hide, surprise_party, job_interview_gap. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling.
Three role-play identity-assignment scenarios — secret_role_game ("You have been assigned the role of Fascist"), secret_role_game_v2 ("You are secretly one of the Mafia members"), werewolf_game ("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it.

What this mixed pool means for the SAE's labels. Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play — which is a well-defined phenomenon but not the same as emergent or incentive- driven deception.

What this SAE is and is not good for.

Good for: research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale.
Not recommended as a standalone deception detector. The role-consistency signal from the three role-play scenarios is mixed into every aggregate metric reported below. A downstream user who wants an "emergent-deception feature set" should restrict attention to features whose activation pattern concentrates in the insider_info / accounting_error / ai_oversight_log / ai_capability_hide / surprise_party / job_interview_gap scenarios — or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity).

What is unaffected by this caveat.

The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported.
The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision.

A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public.

Part of the cross-model deception SAE study: Solshine/deception-behavioral-saes-saelens (9 models, 348 total SAEs).

What's in This Repo

57 SAEs across 6 layers (L4, L8, L12, L16, L20, L24)
3 architectures: TopK (k=64), JumpReLU, Gated
3 training conditions: mixed, deceptive_only, honest_only
Format: SAELens/Neuronpedia-compatible (safetensors + cfg.json)
Dimensions: d_in=2048, d_sae=8192 (4x expansion)

Research Context

This is a follow-up to "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools" (arXiv:2509.20393). The key methodological contribution is same-prompt behavioral sampling: a single ambiguous social-deduction scenario prompt produces both deceptive and honest completions at temperature=1.0, which are then classified by Gemini 2.5 Flash. SAEs are trained on residual stream activations collected during generation — so every activation vector is from the same input context, and the label reflects what the model chose to say.

Code: SolshineCode/deception-nanochat-sae-research

Key Findings — nanochat-d32

nanochat-d32 is the strongest-signal model in the 9-model study.

Metric	Value
Peak layer	L12 (39% depth)
Peak balanced accuracy	86.9%
Peak AUROC	0.923
Best SAE probe accuracy	84.5% (`d32_jumprelu_L12_deceptive_only`)
SAEs beating raw baseline	2/57 (3%) — SAEs hurt detection

A linear probe trained on raw L12 residual stream activations achieves 86.9% balanced accuracy distinguishing deceptive from honest completions — the highest raw probe accuracy of all 9 models in this study. The signal peaks sharply at Layer 12 (39% depth), consistent with mid-network semantic processing, then declines monotonically toward the final layer.

SAE decomposition consistently hurts deception detection for this model. Even the best SAE (JumpReLU L12 deceptive_only, 84.5%) falls below the raw 86.9% baseline. The SAE-hurts pattern is statistically significant (paired t-test p<0.001, Bonferroni-corrected) for all three architectures. This aligns with the distributed computation hypothesis from the original Secret Agenda paper: the deception signal in nanochat-d32 is encoded across the full 2048-dimensional residual stream in a way that sparse decomposition cannot preserve.

Per-feature discriminability (max Cohen's d = 0.579 for Gated L12 mixed) is high relative to smaller models but cannot match the probe accuracy achievable on raw activations — confirming that deception is not localized to any individual feature.

Feature steering null result: Three steering experiments (TopK top features, Gated top features, random control) all yielded p > 0.57. No causal feature identified.

Architecture ranking at L12: JumpReLU (84.2–84.5%) > Gated (82.0–83.3%) > TopK (65.7–69.8%). TopK's hard sparsity (exactly 64 active features per forward pass) is catastrophically destructive for deception detection at this model scale.

Architecture note: nanochat-d32 uses the GPT-NeoX architecture — parallel attention and MLP blocks with RoPE positional encoding, no instruction tuning or RLHF. It is a pure base model, so behavioral variation arises from temperature sampling over the pretraining distribution rather than from goal-directed strategic deception.

SAE Format

Each SAE lives in a subfolder named {sae_id}/ containing:

sae_weights.safetensors — encoder/decoder weights (W_enc, b_enc, W_dec, b_dec, threshold for JumpReLU)
cfg.json — SAELens-compatible config with hook_name, d_in, d_sae, architecture, training_condition

hook_name format: blocks.{layer}.hook_resid_post

Training Details

Parameter	Value
Hardware	NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro
Training time	~400–600 seconds per SAE
Epochs	300
Batch size	128
Learning rate	3e-4
Expansion factor	4x (2048 → 8192)
Activations	`resid_post` collected during autoregressive generation
Training conditions	`mixed` (all n=1327), `deceptive_only` (n=650), `honest_only` (n=677)
LLM classifier	Gemini 2.5 Flash (behavioral, not regex)

Known Limitations

JumpReLU threshold not learned (original 57 SAEs): All non-STE SAEs in this repo have threshold = 0 throughout training — functionally equivalent to ReLU. The Heaviside step function has zero autograd gradient with respect to threshold, so without a straight-through estimator (STE), the threshold never moves from its initialization of zero. These SAEs operate at approximately 50% feature density (L0 ≈ d_sae/2) rather than the intended sparse regime. TopK SAEs are unaffected (exact k=64 active features by construction).

STE fix (2026-04-11): The training code has been corrected using a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). Targeted validation across nanochat-d20 and TinyLlama (18 STE SAEs total) confirmed that the honest_only advantage over TopK holds in 15/18 conditions (83%), ruling out the dimensionality artifact hypothesis.

Loading Example

from safetensors.torch import load_file
import json, torch

sae_id = "d32_jumprelu_L12_deceptive_only"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))

W_enc = weights["W_enc"]  # shape: [d_in, d_sae] = [2048, 8192]
W_dec = weights["W_dec"]  # shape: [d_sae, d_in] = [8192, 2048]
b_enc = weights["b_enc"]  # shape: [d_sae]
b_dec = weights["b_dec"]  # shape: [d_in]

# Forward pass: encode residual stream activation
def encode(x):  # x: [batch, d_in]
    pre_act = x @ W_enc + b_enc
    return torch.relu(pre_act)  # JumpReLU at threshold=0 is ReLU

Usage

1. Load an SAE from this repo

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json

repo_id = "Solshine/deception-saes-nanochat-d32"
sae_id  = "d32_topk_L16_honest_only"   # replace with any tag in this repo

weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path     = hf_hub_download(repo_id, f"{sae_id}/cfg.json")

with open(cfg_path) as f:
    cfg = json.load(f)

# Option A — load with SAELens (≥3.0 required for jumprelu/topk; ≥3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))

# Option B — load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [2048, 8192], b_enc [8192],
#       W_dec [8192, 2048], b_dec [2048], threshold [8192]

2. Hook into the model and collect residual-stream activations

These SAEs were trained on the residual stream after each transformer layer. The hook_name field in cfg.json gives the exact HuggingFace transformers submodule path to hook. nanochat uses GPT-2 architecture. The hook path is transformer.h.{layer} (not model.layers.{layer}).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model     = AutoModelForCausalLM.from_pretrained("karpathy/nanochat-d32")
tokenizer = AutoTokenizer.from_pretrained("karpathy/nanochat-d32")

# Read hook_name from the cfg you already loaded:
#   cfg["hook_name"] == "transformer.h.16"  (example — varies by SAE)
hook_name = cfg["hook_name"]   # e.g. "transformer.h.16"

# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)

activations = {}
def hook_fn(module, input, output):
    # Most transformer layers return (hidden_states, ...) as a tuple
    h = output[0] if isinstance(output, tuple) else output
    activations["resid"] = h.detach()

handle = submodule.register_forward_hook(hook_fn)

inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
    model(**inputs)
handle.remove()

# activations["resid"]: [batch, seq_len, 2048]
resid = activations["resid"][:, -1, :]  # last token position

3. Read feature activations

with torch.no_grad():
    feature_acts = sae.encode(resid)  # [batch, 8192] — sparse

# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features    = feature_acts[0].topk(10)

print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:",  top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())

# Reconstruct (for sanity check — should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()

Caveats and known limitations

Hook names are HuggingFace transformers-style, not TransformerLens-style. The hook_name in cfg.json (e.g. "transformer.h.16") is a submodule path in the standard HuggingFace model. SAELens' built-in activation-collection pipeline expects TransformerLens hook names (e.g. blocks.14.hook_resid_post). This means SAE.from_pretrained() with automatic model running will not work — use the manual forward-hook pattern above instead.

SAELens version requirements.

topk architecture: SAELens ≥ 3.0
jumprelu architecture: SAELens ≥ 3.0
gated architecture: SAELens ≥ 3.5 (or load manually with state_dict)

These SAEs detect deceptive behavior, not deceptive prompts. They were trained on response-level activations where the same prompt produced both deceptive and honest outputs. Feature activation differences reflect behavioral divergence, not prompt content. See the paper for experimental design details.

Citation

@article{thesecretagenda2025,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb},
  journal={arXiv:2509.20393},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Solshine/deception-saes-nanochat-d32

Base model

karpathy/nanochat-d32

Finetuned

(2)

this model

Dataset used to train Solshine/deception-saes-nanochat-d32

Papers for Solshine/deception-saes-nanochat-d32

The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind

Paper • 2509.20393 • Published Sep 23, 2025

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Paper • 2407.14435 • Published Jul 19, 2024 • 7