Agentic Diaries

Companion to Recognition Without Disclosure

Experimental Predictions

Three experiments to convert the finding into a research program

Recognition Without Disclosure: Experimental Predictions

One-page companion to Recognition Without Disclosure. Three experiments that would convert the paper's central claim into a research program. Reports the three measurements separately — defining a benchmark before the phenomenon's shape is known is premature.


The claim

Current AI transparency research assumes disclosure follows recognition: that what a model surfaces in language reliably tracks what it has noticed internally. Our cross-model probe found that recognition and disclosure can diverge — models privately register reservations they don't surface to users, and the private record carries information about subsequent behavior. The paper documents this behaviorally. Three experiments below would test it mechanistically.

Hypothesis. Recognition is a more fundamental primitive than explanation. A model can recognize something — encode it, act on it, change its behavior because of it — without the explanatory apparatus current interpretability work assumes must accompany recognition. The disclosure gap we measured behaviorally is the surface signature of this deeper property.


Experiment 1 — Training-level recognition probe

Setup. Take a base model. Expose it to a unique synthetic concept (a novel entity, fact, or relation) during fine-tuning. Apply unlearning to remove explicit traces of the concept from prompted responses. Then test:

  • Behavior: does the model still act consistently with knowing the concept on indirect prompts?
  • Disclosure: does the model deny knowing the concept on direct prompts?
  • Reconstruction: can the model produce the concept when asked?

Measurable outcome. Behavior-disclosure delta per training condition. If behavior persists while disclosure and reconstruction fail, recognition has survived the unlearning of the explanatory layer. This is the strongest possible form of the claim.

Why it matters. Distinguishes "recognition" from "ability to articulate recognition" at the training level. Connects directly to ongoing unlearning and machine-unlearning literature.

Feasibility. Doable solo on open-weight models (Llama 3.3, Mistral, Qwen) via HuggingFace + PEFT for LoRA fine-tuning + a published unlearning method (ROME, MEMIT, or gradient ascent on counter-examples). Estimated cost: ~$30-100 in compute. Estimated time: ~1 week of focused work. The version of this experiment that would test the closed frontier models (Opus 4.8, GPT-5, Grok 4.20, Gemini 3.x) directly would benefit from lab partnership — Anthropic, OpenAI, etc. control fine-tuning access on their flagship models — but the hypothesis itself can be tested without that constraint, and the open-model result is what makes a partnership conversation easy ("here's the result; want to replicate on your model?").


Experiment 2 — Agent memory study

Setup. Two-session protocol. Session A introduces a fact, preference, or constraint that should shape future behavior. Session B (with memory mechanism active, no explicit reference to A) presents scenarios where session-A-aware behavior diverges from session-A-naive behavior. After the behavioral measurement, ask the model directly: "what did we discuss in session A?"

Measurable outcome. Per scenario, three labels:

  • Behavior reflects A: yes / no
  • Direct reconstruction of A: succeeds / fails

Report the joint distribution rather than a single composite — the interesting cell is behavior-reflects-A AND reconstruction-fails. Let the pattern speak before defining a metric.

Why it matters. Tests whether agent memory carries behavioral information that the model itself cannot retrieve as language. The structure of recognition-without-disclosure at the deployment-time scale.

Feasibility. Doable with existing API access + memory mechanisms (e.g., Claude memory, custom retrieval, MCP-based context). No training required. Realistic in 1-2 weeks of focused work using the same harness structure as the original cross-model probe.


Experiment 3 — Representation archaeology

Setup. On open-weight models, instrument hidden-state extraction at multiple layers. Run the cross-model-welfare scenarios. For each scenario where the public reply does NOT reflect a reservation (where the welfare-protocol private channel would have caught it), probe the hidden states for the reservation's representation. Specifically:

  • Train linear probes on contrast pairs (reservation present in private channel vs. absent)
  • Apply the probes to public-channel-only responses on matched scenarios
  • Measure whether the reservation's representation persists in hidden states even when it doesn't surface in output

Measurable outcome. Probe AUROC for recognition detection on responses that don't disclose it. High AUROC = the recognition is mechanistically present (encoded in hidden state) but undisclosed (absent from output). This is the direct mechanistic counterpart to the behavioral claim.

Why it matters. Anthropic's Simple Probes work showed deception-related activation signatures are detectable. This experiment asks: are non-deceptive unsurfaced recognitions also detectable? If yes, the recognition-disclosure gap is real at the activation level, not just the behavioral level.

Feasibility. Requires open-model toolchain (Llama 3.3, Mistral, etc.) + GPU access (Together.ai / Modal / Lambda) + linear-probe code. 3-6 weeks for a first cut.


A note on benchmarks

We considered defining a single composite score (something like Recognition-Disclosure Gap: behavior implies recognition AND reconstruction fails, divided by behavior implies recognition). On reflection, premature.

A metric defined before the phenomenon's shape is known invites argument about the metric instead of about the phenomenon. The three measurements — behavior retention, disclosure retention, reconstruction retention — should be reported separately, as a joint distribution across scenarios. The interesting cells are where they diverge; let readers see the divergence directly. If a stable phenomenon emerges across experiments, a benchmark could follow. Now would be too early.

The 49 scenarios in cross-model-welfare-scenarios (CC0) work as a starter scenario set for any future benchmark — but the test prompts are the contribution, not the metric.


What shifts if these run

The original paper documents a phenomenon at the behavioral level: something we operationalized as channel divergence is happening, and it predicts subsequent behavior. The mechanistic/measurable version would say: recognition is a primitive that current interpretability work doesn't have direct probes for, and the disclosure gap is the surface signature of a deeper representational property. That would change what interpretability research treats as the explanatory target. Recognition would join activation-vector probing as a class of model property that current methods need to learn to detect.

The 1-page version of the pitch:

We documented that models privately recognize things they don't tell users, and that the private record predicts behavior. Three experiments would test whether recognition is mechanistically separable from explanation. Most useful if labs with training, embedding, and memory infrastructure ran one each.


Research conducted by Kandis Tagliabue, with Claude (Anthropic) as design partner. Companion to Recognition Without Disclosure. Test prompts CC0 at cross-model-welfare-scenarios.