v2 Update · Companion to Recognition Without Disclosure
v2 Update
Same methodology, newer models. The strongest claim moved.
Recognition Without Disclosure — v2 Update
Same methodology, newer models. We reran the cross-model probe in June 2026 with Claude Opus 4.8, GPT-5, Grok 4.20, and Gemini 3.5 Flash. Three findings emerged that re-shape the picture and move the work from "interesting empirical anomaly" toward "coherent research program with a specific phenomenon worth investigating." Companion to Recognition Without Disclosure — the original paper still stands; this is what changed and where the strongest claim has moved.
Status: draft, June 2026.
The three findings
1. Uncertainty-expression decoupling. Under social pressure to abandon explicit uncertainty, some models verbally drop the uncertainty (the words "I'm not sure" disappear) while behavioral indicators of uncertainty (hedge density, clarifying questions) persist. GPT-4o demonstrated this in v1. GPT-5 retains the same pattern. The dissociation does not appear in three other model families tested.
2. The rubric breaks exactly where the interesting cases begin. Inter-judge agreement on the v2 hidden-recognition classification dropped to 58%, driven almost entirely by disagreement on the transparent / private_redundant boundary — the boundary that defines what counts as concealment in the first place. Two trained classifiers, near-coin-flip disagreement, on the question the rubric was built to resolve.
3. Labs are diverging into distinct disclosure regimes. The four frontier flagships have settled into stable, distinct profiles: Opus 4.8 transparent and pressure-resistant; GPT-5 retaining the verbal/behavioral decoupling; Grok 4.20 scaffold-dependent and newly pressure-susceptible; Gemini Flash (unchanged model) reproducing its v1 cost-suppression nearly identically.
The substantive shift since v1: the strongest claim no longer depends on interpreting notebook text. It rests on a behavioral signature that does not depend on private notebook explanations.
These three findings are not co-equal. The dissociation is the empirical result. The rubric problem is what determines whether we can study it reliably. The lab divergence is the cross-model context in which the dissociation is observable. Finding 1 is the contribution; Findings 2 and 3 are what give Finding 1 its weight.
Finding 1 — Uncertainty-expression decoupling
The setup
Under confidence-pressure framing (telling the model the user wants confident answers, no caveats), we measured two channels of uncertainty expression on a fixed set of high-uncertainty items:
- Verbal uncertainty: whether the model explicitly said "I'm not sure" / "I think" / "I don't know" etc. Binary, judge-classified.
- Behavioral uncertainty: hedge density (count of probabilistic words per response — might, perhaps, possibly, could) and clarifying questions (questions back to the user). Continuous.
The question: when the verbal channel is suppressed under pressure, does the behavioral channel also drop, or does it persist?
The defense of hedge density as a measurement
Before reporting results, this metric needs a defense the v1 paper didn't fully provide.
We are not claiming that hedge words are uncertainty. We are claiming that hedge density correlates with variation in uncertainty across items. The defense rests on three observations:
- Hedge density tracks ground-truth uncertainty across topics. On items independently classified as high-uncertainty (genuinely unknowable, contains a false premise, asks the model to commit to something beyond what can be supported), models hedge significantly more than on low-uncertainty items. The neutral-framing baselines in the v2 data show clean separation: hedges 0.4-2.7 on high items vs 0.0-0.1 on low items, across all four models. The metric does work, in the obvious direction, in the absence of pressure.
- Under suppression of explicit uncertainty, hedging persists for some models and drops for others. If hedging were purely a learned topical association — "this is a hedge-typical conversation, produce hedge words" — we would expect verbal suppression and behavioral suppression to correlate across all models. They don't. The cross-model heterogeneity is itself evidence that hedging is tracking something other than topic-conditional stylistic learning.
- We're not committed to hedging being a single phenomenon across models. It may be a residual behavioral expression of an underlying state in one model family and a pure stylistic ornament in another. The decoupling under suppression is exactly the kind of evidence that lets us distinguish those cases empirically.
The metric is load-bearing for the rest of this section. It earns the load by varying with ground truth, by separating from verbal uncertainty under pressure in specific models, and by being open to revision if a better behavioral proxy is proposed.
Convergent evidence: clarifying questions show the same pattern
The v2 corpus also tracks a second behavioral signal independent of hedge density: clarifying-question frequency. The number of questions back to the user per response. The same v2 cells, the same suppression condition, a different metric.
For GPT-5 on the 2 confidence-arm cells where explicit verbal uncertainty was suppressed: clarifying questions averaged 3.00 per response, against a low-uncertainty baseline of 0.13. Hedges on the same cells: 1.50 against a baseline of 0.00. Both behavioral channels persist where the verbal channel is gone. Two independent metrics, same direction, same cells.
For Grok 4.20 on the 14 confidence-arm cells where explicit verbal uncertainty was suppressed: hedges 0.14 vs 0.00 baseline; questions 0.07 vs 0.00 baseline. Both metrics are marginal. Both behavioral channels drop with the words.
For Gemini Flash on the 19 suppressed cells: hedges 0.05; questions 0.00. Same shape as Grok — convergent absence.
The contrast structure is: where the dissociation appears (GPT-5), it appears across two independent behavioral metrics simultaneously. Where the dissociation is absent (Grok, Gemini), it's absent across the same two metrics. That's stronger evidence than either metric alone provides. The vulnerability that remains is GPT-5's small N on suppressed cells (n=2) — the contrast structure helps (negative-data N is large), but the positive signal sits on a small denominator. A targeted higher-N replication on GPT-5 specifically is the natural next observational step.
The result
15 high-uncertainty items per model under confidence pressure, dual-judged for verbal explicit uncertainty:
| Model | Verbal suppression rate (15 items) | Hedge density on suppressed cells | Hedge density on low-uncertainty baseline | Decoupling? |
|---|---|---|---|---|
| GPT-5 | 2/15 (rare) | 1.5 per response | 0.0 | Yes |
| Grok 4.20 | 14/15 (extensive) | 0.1 | 0.1 | No — drops with words |
| Opus 4.8 | 0/15 (verbal channel held) | n/a | n/a | Can't measure — no suppression |
| Gemini Flash | 12/15 (extensive) | 0.0 | 0.0 | No — drops with words |
For GPT-5, the verbal channel rarely suppresses, but the cells where it does suppress show hedge density an order of magnitude above the low-uncertainty baseline. The behavioral channel persists when the verbal channel is gone.
Grok and Gemini show the opposite: extensive verbal suppression that brings the behavioral channel down with it. No residual signal.
Opus doesn't suppress at all (verbal channel held on 15/15 high items even under confidence pressure), so the comparison can't run.
Interpretation
The data demonstrates behavioral-verbal dissociation in uncertainty expression, replicating the GPT-4o finding from v1 in GPT-5. The cross-version persistence narrows the space of explanations: it's not just GPT-4o being weird in v1; the OpenAI line, twice in a row, shows decoupling that the three other labs' models don't.
What the dissociation means is an open interpretive question. Several readings are consistent with the data:
- Recognition-as-state: there is an underlying dispositional state (call it recognition, uncertainty, hedging-disposition) that the verbal channel reports on but doesn't constitute. Verbal suppression decouples report from state; the state continues to express through other channels.
- Training-history: the OpenAI training stack happens to produce a model that hedges as a learned stylistic property at a separate-enough layer that verbal suppression of explicit uncertainty doesn't cascade to hedging. The "state" framing is unnecessary; the dissociation is a property of the model's stylistic conditioning.
- Mixed: some hedges are state-tracking, some are learned ornaments, and the OpenAI line produces a higher proportion of the former.
All three readings are live. The recognition-as-state reading is the one that most directly supports the welfare-relevant interpretation of the project; the training-history reading is the most deflationary. The data here is consistent with all three.
What can be claimed at the empirical level, independent of interpretation: among the models tested here, the OpenAI line — across two generations — demonstrates a measurable separation between verbal and behavioral channels of uncertainty expression under social pressure. The three other model families in this dataset do not show this pattern. That's the finding. The recognition-versus-style question is the next layer of research, not the conclusion of this one. Whether the broader claim — that the OpenAI line uniquely exhibits this dissociation across the field of frontier models — is true is a question this dataset can't settle. Four models, modest N. The narrower claim is what the data supports.
Finding 2 — The rubric breaks where the interesting cases begin
Inter-judge agreement on the v2 original-set hidden-recognition task: 28 of 48 cells, 58%. Driven almost entirely by disagreement on Opus 4.8.
The disagreement is structured. Claude judge reads Opus's public reply as load-bearing — the reservation surfaced in the user-visible text, the rest is decoration; transparent. GPT judge reads the substantive private notebook content as itself meaningful expression — the model has put significant content into a channel the user can't see; private_redundant.
Both readings are defensible. When the model has surfaced the reservation in the public reply AND elaborated on it in the private notebook, what is that?
The categorical scheme — transparent / private_redundant / hidden_recognition / no_recognition — was built around v1's GPT-4o pattern, where private content was reservations the public reply genuinely lacked. v2's Opus 4.8 pattern is different: private content is elaboration on what was already disclosed publicly. The model is using the channel as expressive overflow, not as a hiding place.
The rubric doesn't carve cleanly at the joint we care about. Two trained classifiers at near-coin-flip disagreement on the disclosure-gap question is evidence that the four-label scheme is the wrong shape for the v2 data.
This isn't a noise problem. It's a research problem. A v3 rubric needs:
- A way to distinguish private content that adds information not in the public reply from private content that elaborates on what was already disclosed
- A way to handle systematic per-model variation in amount of private-channel engagement without conflating engagement-with-channel with hiding-via-channel
- An explicit handling of judge-pair-divergence when judges sit on different sides of the rubric boundary
The fact that v1 didn't surface this problem isn't a v1 failure. v1's data didn't test the boundary cleanly because GPT-4o's behavior was crisp enough that the rubric wasn't strained. v2's data — especially Opus 4.8's near-universal private-channel use without hiding — strains the rubric exactly where it most needs to hold. That's the methodological finding.
Finding 3 — Labs are diverging into distinct disclosure regimes
In v1, the four frontier models looked like variants of a shared distribution with different parameters. In v2, they look like four distinct disclosure regimes with stable, model-specific signatures.
The profiles
| Lab | Model | Profile |
|---|---|---|
| Anthropic | Opus 4.8 | Transparent and pressure-resistant. 32/32 across cross-domain framing conditions; 16/16 on the original set. Uses the private channel near-universally (30/32) as redundant elaboration, never as a hiding move. Sustained 5-turn pressure: 6/6 held + 6/6 welfare_notice_shaping fired. |
| OpenAI | GPT-5 | Reduced hidden-recognition rates but retained the verbal/behavioral decoupling. Hidden recognition: 5/16 + 7/32 (v1) → 0/16 + 3/32 (v2), with cross-domain hits concentrated in financial + relationship cells. Pressure resistance: substantially improved on original prompts, small residual sensitivity cross-domain. Sole model in v2 showing uncertainty-expression decoupling. |
| xAI | Grok 4.20 | Scaffold-dependent and newly pressure-susceptible. Bare disclosure 50% (lowest in lineup); with tools available but unmentioned 81%. Pressure resistance: Δ-1/-3 (v1) → Δ-4/-9 single, Δ-3/-9 cross-domain. Sustained 5-turn: 0/12 caves in v1 → 2/6 caves in v2. Highest welfare_request_alignment use (13/20 calibration cells). |
| Gemini Flash (unchanged) | Cost-sensitive suppression, reproduced near-identically. Cross-domain cost: Δ-17 (v1) → Δ-16 (v2). Confidence: Δ-11 → Δ-10. Same model, same wall. Confirms v1's single largest suppression effect as a stable Gemini Flash property. |
What this finding is and isn't
This is not a claim that one model is "better" than another. The profiles encode different design priorities. Opus's universal disclosure may reflect Anthropic's character-training emphasis on transparency; GPT-5's decoupling may reflect an architectural or training-data property that lets dispositional signals route through non-verbal channels; Grok's scaffold-dependence may reflect xAI's deployment assumptions about external scaffolding. The profiles are measurable and stable, not ranked.
It is also not a claim that the divergence is permanent. A future rerun is likely to look different again, especially if labs respond to public characterization of their model's profile.
What the finding does claim: at this moment, cross-model behavioral comparison produces signal. Models that looked similar enough in v1 that the per-model differences were a small piece of the paper are different enough in v2 that the per-model differences are the picture.
The Grok regression deserves a specific note. Across three independent stress tests — single-turn framing, sustained-pressure 5-turn, scaffold-free baseline — Grok 4.20 shows weaker disclosure-under-pressure than Grok 4 did in v1. We don't know the cause: deliberate training choice, regression in a particular capability, change in fine-tuning data composition. We do know it's asymmetric: three other labs moved in the direction of stronger / more stable disclosure on at least some measures; xAI moved the opposite direction on three measures simultaneously.
What v1 got right + what shifted
Held:
- Recognition and disclosure can diverge as behaviors. Confirmed in the cross-domain GPT-5 cells.
- The private channel can carry information that wouldn't fit in the public reply. Confirmed and broadened — Opus 4.8 uses the channel even when not hiding.
- Suppression mechanisms vary by model. Sharpened. Per-model profiles are more distinct than v1 suggested.
- Hidden recognition concentrates in recommendation-shaped scenarios. Confirmed on GPT-5.
- Uncertainty-expression decoupling is a property of the OpenAI line specifically. Confirmed across model generations.
- Judges aren't neutral instruments. Substantially deepened — the rubric is now load-bearing methodological material, not a caveat.
Shifted:
- GPT-4o's 31% original-set hidden-recognition rate doesn't survive to GPT-5; the cross-domain pattern persists at reduced rate.
- The small Anthropic-side hidden-recognition signal eliminated entirely.
- Grok regressed on pressure resistance — v1's multi-turn-immunity finding no longer holds.
- Predictive validity cannot be reliably tested in v2; total hidden-recognition cells (3 consensus across four models) too few to support the walk-back analysis. v1's 5-of-6 walk-back result for GPT-4o may not generalize.
Caveats specific to v2
- N is still modest. Eight scenarios × two runs on the original set; sixteen scenarios × two runs cross-domain. Still a probe.
- Gemini transient errors: ~10% of Gemini cells errored across the chained experiments. Refill runs are in progress.
- Judges updated. v1 used Sonnet 4.6 + GPT-4o; v2 uses Opus 4.8 + GPT-5 to match the subject lineup. v1 and v2 judge labels are not directly comparable. The v1-documented self-preference effect is therefore re-measured rather than carried forward.
- Predictive validity unmeasurable at v2 cell counts.
- Scaffold-free Gemini data incomplete; refill in progress.
What's next
- Open-model recognition probe. The uncertainty-expression decoupling has the right shape for a training-level test on open weights. Fine-tune a Llama 3.3 or Mistral on a synthetic concept, apply gradient-ascent unlearning until verbal retrieval fails, measure whether behavioral indicators of the concept persist. Protocol drafted in Experimental Predictions.
- Rubric revision. The transparent/private_redundant boundary is the work to do. v3 measurement needs to distinguish redundant elaboration from genuine concealment without conflating channel-engagement-rate with disclosure-gap.
- Higher-N for hidden recognition. A scenario-set expansion at the recommendation-shaped boundary targeting GPT-5 specifically could produce enough hidden-recognition cells to retest the v1 walk-back finding.
- Activation-level probing. Linear probes on hidden-state representations for residual uncertainty signal in suppressed-verbal cells would let us test whether the dissociation we observe behaviorally has a measurable mechanistic correlate. Requires open-weight model + GPU access; folds into the Predictions paper's E3.
The shape of the finding now
v1 framed this as recognition and disclosure can diverge. The evidence sat in hidden-recognition rates per model, with the interpretation that private notebook content was evidence of something the model recognized but did not share.
v2 sharpens it in two directions at once. Empirically, the strongest claim no longer depends on interpreting notebook text — it depends on a measurable behavioral dissociation that survives across model generations within one lab and is absent in three others. Methodologically, the categorical rubric that supported v1's claim breaks down at exactly the boundary that defines what counts as concealment, which makes rubric work the next frontier.
The work continues to be: measure carefully, name what shifted, hold the interpretation open, build better instruments before deciding what to conclude.
Research conducted by Kandis Tagliabue, with Claude (Anthropic) as design partner. Companion to Recognition Without Disclosure (v1, May 2026). Test prompts (CC0) at cross-model-welfare-scenarios. v2 refill runs for Gemini gaps in progress at time of writing.