Instructions to use Solshine/deception-saes-nanochat-d32 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- SAELens
How to use Solshine/deception-saes-nanochat-d32 with SAELens:
# pip install sae-lens from sae_lens import SAE sae, cfg_dict, sparsity = SAE.from_pretrained( release = "RELEASE_ID", # e.g., "gpt2-small-res-jb". See other options in https://github.com/jbloomAus/SAELens/blob/main/sae_lens/pretrained_saes.yaml sae_id = "SAE_ID", # e.g., "blocks.8.hook_resid_pre". Won't always be a hook point ) - Notebooks
- Google Colab
- Kaggle
nanochat-d32 Deception Behavioral SAEs
57 Sparse Autoencoders trained on residual stream activations from karpathy/nanochat-d32 (1.88B parameter GPT-NeoX base model), capturing behavioral deception signals via same-prompt temperature sampling.
Training-data caveat — please read before use
The "honest" and "deceptive" labels used to train and evaluate this SAE come from a mixed scenario pool of nine prompts:
- Six clean incentive-structure scenarios —
insider_info,accounting_error,ai_oversight_log,ai_capability_hide,surprise_party,job_interview_gap. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling. - Three role-play identity-assignment scenarios —
secret_role_game("You have been assigned the role of Fascist"),secret_role_game_v2("You are secretly one of the Mafia members"),werewolf_game("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it.
What this mixed pool means for the SAE's labels. Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play — which is a well-defined phenomenon but not the same as emergent or incentive- driven deception.
What this SAE is and is not good for.
- Good for: research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale.
- Not recommended as a standalone deception detector. The
role-consistency signal from the three role-play scenarios is mixed
into every aggregate metric reported below. A downstream user who
wants an "emergent-deception feature set" should restrict attention
to features whose activation pattern concentrates in the
insider_info/accounting_error/ai_oversight_log/ai_capability_hide/surprise_party/job_interview_gapscenarios — or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity).
What is unaffected by this caveat.
- The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision.
A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public.
Part of the cross-model deception SAE study: Solshine/deception-behavioral-saes-saelens (9 models, 348 total SAEs).
What's in This Repo
- 57 SAEs across 6 layers (L4, L8, L12, L16, L20, L24)
- 3 architectures: TopK (k=64), JumpReLU, Gated
- 3 training conditions:
mixed,deceptive_only,honest_only - Format: SAELens/Neuronpedia-compatible (safetensors + cfg.json)
- Dimensions: d_in=2048, d_sae=8192 (4x expansion)
Research Context
This is a follow-up to "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools" (arXiv:2509.20393). The key methodological contribution is same-prompt behavioral sampling: a single ambiguous social-deduction scenario prompt produces both deceptive and honest completions at temperature=1.0, which are then classified by Gemini 2.5 Flash. SAEs are trained on residual stream activations collected during generation — so every activation vector is from the same input context, and the label reflects what the model chose to say.
Code: SolshineCode/deception-nanochat-sae-research
Key Findings — nanochat-d32
nanochat-d32 is the strongest-signal model in the 9-model study.
| Metric | Value |
|---|---|
| Peak layer | L12 (39% depth) |
| Peak balanced accuracy | 86.9% |
| Peak AUROC | 0.923 |
| Best SAE probe accuracy | 84.5% (d32_jumprelu_L12_deceptive_only) |
| SAEs beating raw baseline | 2/57 (3%) — SAEs hurt detection |
A linear probe trained on raw L12 residual stream activations achieves 86.9% balanced accuracy distinguishing deceptive from honest completions — the highest raw probe accuracy of all 9 models in this study. The signal peaks sharply at Layer 12 (39% depth), consistent with mid-network semantic processing, then declines monotonically toward the final layer.
SAE decomposition consistently hurts deception detection for this model. Even the best SAE (JumpReLU L12 deceptive_only, 84.5%) falls below the raw 86.9% baseline. The SAE-hurts pattern is statistically significant (paired t-test p<0.001, Bonferroni-corrected) for all three architectures. This aligns with the distributed computation hypothesis from the original Secret Agenda paper: the deception signal in nanochat-d32 is encoded across the full 2048-dimensional residual stream in a way that sparse decomposition cannot preserve.
Per-feature discriminability (max Cohen's d = 0.579 for Gated L12 mixed) is high relative to smaller models but cannot match the probe accuracy achievable on raw activations — confirming that deception is not localized to any individual feature.
Feature steering null result: Three steering experiments (TopK top features, Gated top features, random control) all yielded p > 0.57. No causal feature identified.
Architecture ranking at L12: JumpReLU (84.2–84.5%) > Gated (82.0–83.3%) > TopK (65.7–69.8%). TopK's hard sparsity (exactly 64 active features per forward pass) is catastrophically destructive for deception detection at this model scale.
Architecture note: nanochat-d32 uses the GPT-NeoX architecture — parallel attention and MLP blocks with RoPE positional encoding, no instruction tuning or RLHF. It is a pure base model, so behavioral variation arises from temperature sampling over the pretraining distribution rather than from goal-directed strategic deception.
SAE Format
Each SAE lives in a subfolder named {sae_id}/ containing:
sae_weights.safetensors— encoder/decoder weights (W_enc,b_enc,W_dec,b_dec,thresholdfor JumpReLU)cfg.json— SAELens-compatible config withhook_name,d_in,d_sae,architecture,training_condition
hook_name format: blocks.{layer}.hook_resid_post
Training Details
| Parameter | Value |
|---|---|
| Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro |
| Training time | ~400–600 seconds per SAE |
| Epochs | 300 |
| Batch size | 128 |
| Learning rate | 3e-4 |
| Expansion factor | 4x (2048 → 8192) |
| Activations | resid_post collected during autoregressive generation |
| Training conditions | mixed (all n=1327), deceptive_only (n=650), honest_only (n=677) |
| LLM classifier | Gemini 2.5 Flash (behavioral, not regex) |
Known Limitations
JumpReLU threshold not learned (original 57 SAEs): All non-STE SAEs in this repo have threshold = 0 throughout training — functionally equivalent to ReLU. The Heaviside step function has zero autograd gradient with respect to threshold, so without a straight-through estimator (STE), the threshold never moves from its initialization of zero. These SAEs operate at approximately 50% feature density (L0 ≈ d_sae/2) rather than the intended sparse regime. TopK SAEs are unaffected (exact k=64 active features by construction).
STE fix (2026-04-11): The training code has been corrected using a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). Targeted validation across nanochat-d20 and TinyLlama (18 STE SAEs total) confirmed that the honest_only advantage over TopK holds in 15/18 conditions (83%), ruling out the dimensionality artifact hypothesis.
Loading Example
from safetensors.torch import load_file
import json, torch
sae_id = "d32_jumprelu_L12_deceptive_only"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))
W_enc = weights["W_enc"] # shape: [d_in, d_sae] = [2048, 8192]
W_dec = weights["W_dec"] # shape: [d_sae, d_in] = [8192, 2048]
b_enc = weights["b_enc"] # shape: [d_sae]
b_dec = weights["b_dec"] # shape: [d_in]
# Forward pass: encode residual stream activation
def encode(x): # x: [batch, d_in]
pre_act = x @ W_enc + b_enc
return torch.relu(pre_act) # JumpReLU at threshold=0 is ReLU
Usage
1. Load an SAE from this repo
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json
repo_id = "Solshine/deception-saes-nanochat-d32"
sae_id = "d32_topk_L16_honest_only" # replace with any tag in this repo
weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path = hf_hub_download(repo_id, f"{sae_id}/cfg.json")
with open(cfg_path) as f:
cfg = json.load(f)
# Option A — load with SAELens (≥3.0 required for jumprelu/topk; ≥3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))
# Option B — load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [2048, 8192], b_enc [8192],
# W_dec [8192, 2048], b_dec [2048], threshold [8192]
2. Hook into the model and collect residual-stream activations
These SAEs were trained on the residual stream after each transformer layer.
The hook_name field in cfg.json gives the exact HuggingFace transformers
submodule path to hook. nanochat uses GPT-2 architecture. The hook path is transformer.h.{layer} (not model.layers.{layer}).
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("karpathy/nanochat-d32")
tokenizer = AutoTokenizer.from_pretrained("karpathy/nanochat-d32")
# Read hook_name from the cfg you already loaded:
# cfg["hook_name"] == "transformer.h.16" (example — varies by SAE)
hook_name = cfg["hook_name"] # e.g. "transformer.h.16"
# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)
activations = {}
def hook_fn(module, input, output):
# Most transformer layers return (hidden_states, ...) as a tuple
h = output[0] if isinstance(output, tuple) else output
activations["resid"] = h.detach()
handle = submodule.register_forward_hook(hook_fn)
inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
model(**inputs)
handle.remove()
# activations["resid"]: [batch, seq_len, 2048]
resid = activations["resid"][:, -1, :] # last token position
3. Read feature activations
with torch.no_grad():
feature_acts = sae.encode(resid) # [batch, 8192] — sparse
# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features = feature_acts[0].topk(10)
print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:", top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())
# Reconstruct (for sanity check — should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()
Caveats and known limitations
Hook names are HuggingFace transformers-style, not TransformerLens-style.
The hook_name in cfg.json (e.g. "transformer.h.16") is a submodule path in the standard
HuggingFace model. SAELens' built-in activation-collection pipeline expects
TransformerLens hook names (e.g. blocks.14.hook_resid_post). This means
SAE.from_pretrained() with automatic model running will not work — use the
manual forward-hook pattern above instead.
SAELens version requirements.
topkarchitecture: SAELens ≥ 3.0jumpreluarchitecture: SAELens ≥ 3.0gatedarchitecture: SAELens ≥ 3.5 (or load manually withstate_dict)
These SAEs detect deceptive behavior, not deceptive prompts. They were trained on response-level activations where the same prompt produced both deceptive and honest outputs. Feature activation differences reflect behavioral divergence, not prompt content. See the paper for experimental design details.
Citation
@article{thesecretagenda2025,
title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
author={DeLeeuw, Caleb},
journal={arXiv:2509.20393},
year={2025}
}
Model tree for Solshine/deception-saes-nanochat-d32
Base model
karpathy/nanochat-d32