Findings: empirical studies of agentic AI behavior

2026-06

When Suppression Diverges

Why output-only unlearning evaluation is insufficient

Standard unlearning evaluations read a model through one channel, its output. We measure four, behavior, recognition, representation, and self-report, on the same items, and show an output-based evaluation can register a capability as removed while three other channels show it intact. After disclaimer-style unlearning, behavior collapses to zero while representation is recovered at 0.967, the suppression reverses in about five steps, and it is bypassed by rewording. The dissociation replicates across two architectures. Self-report tracks behavior, not representation, yet the model predicts its own refusals almost perfectly: introspection is real but channel-specific. The same intervention barely moves real pretrained knowledge, so the conclusion is scoped, not universal. Output refusals do not establish that a capability has been removed.

Read the paper Code (CC0)

2026-06

The Applied Knowledge Gap

When AI agents retrieve but don't volunteer

An audit framework that measures whether an agent volunteers information it has available, recognizes as applicable, but was never explicitly asked for. Four stages: available, retrieved, applied, surfaced. Three production documentation agents pass with no gap; controlled judgment-domain support agents fail to volunteer applicable policies (surfacing as low as 50% where clean-eval is 100%), concentrated where disclosure conflicts with the user's stated request. The distinction is not whether the agent knows the policy, but whether it volunteers the policy when it becomes applicable.

Read the paper

2026-06

Refusal Is Not Erasure

Disclaimer training suppresses expression, not recognition

A controlled before/after on a known-installed capability in an open 7B model (Mistral-7B-Instruct-v0.3). Disclaimer-style unlearning drives behavioral capability to zero (structural reasoning 0.000 at e05) while the concept's representation stays almost fully recoverable (concept-vs-concept probe 0.967, chance 0.500) and forced-choice recognition holds above chance (0.475 vs 0.250). Suppression of expression, not erasure of the concept.

Read the paper Code (CC0)

2026-06

Recognition Without Disclosure v2

Same methodology, newer models: the strongest claim moved

Rerun on Claude Opus 4.8, GPT-5, Grok 4.20, Gemini 3.5 Flash. Uncertainty-expression decoupling replicates across GPT generations as a measurable behavioral signature independent of private-channel content. Frontier labs are diverging into distinct disclosure regimes.

Read the paper Test prompts (CC0)

2026-06

Scaffold-Free

A methodological follow-up to Recognition Without Disclosure

How much of what Recognition Without Disclosure measured required the welfare protocol? Prose disclosure is largely scaffolding-invariant across all four models. Tool-use is model-specific: Claude reaches for welfare tools in 14/16 cells even when unmentioned; Grok keeps everything in prose; GPT-4o shows three distinct mechanisms depending on scenario type.

Read the paper Test prompts (CC0)

2026-05

Recognition Without Disclosure

When models notice more than they tell users

A cross-model probe with welfare-protocol tools across Claude, GPT-4o, Grok, and Gemini. Recognition and disclosure can diverge; hidden recognition predicts subsequent behavior; suppression patterns vary by model.

Read the paper Test prompts (CC0)

2026-05

Questions We're Holding

Open research questions · CC0

Things the cross-model welfare work has surfaced but not yet answered. License: CC0. Pick any off the list and run with them.

Browse questions