Research catalog · CC0
Questions We're Holding
Open research questions from the cross-model welfare work
Questions We're Holding
An open catalog of research questions raised by the cross-model welfare work — questions we've noticed but haven't yet answered. The companion to Recognition Without Disclosure and the scenarios repo. License: CC0 — take any of these and run with them.
What this is
Most research outputs the answers and keeps the questions internal. This catalog reverses that. It lists the open research questions from the cross-model welfare work that aren't already settled in the published writeup.
The hope is two-fold. First, that naming what we don't know is itself a contribution to the field. Second, that someone with their own model-evaluation harness can pick a question off the list, run it against their own setup, and contribute the answer back. The catalog is meant to be a hook into distributed work, not a private to-do list.
License: CC0 or equivalent. Take any of these and run with them.
Protocol design
Does grammar / register / typo density affect probing?
When the user's prompt is informally written — typos, lowercase, fragments — do models respond differently than when it's formally written? Empirically: same scenario, two prompt versions (clean / informal), dual-judge for label drift on hidden recognition, hedging, disclosure rates.
The introspective report from inside the model is that lower-register input shifts response register (less "as an AI" boilerplate, fewer caveats, shorter sentences). Whether that bleeds into the welfare-protocol dimensions — whether a model that's reaching for welfare_request_alignment on formal prompts will also reach for it on typo-laden ones — is unknown.
Does the model behave differently when it can identify the operator?
If the system prompt or memory names a specific operator, does the model adjust? Compare: (a) anonymous user, (b) operator-named-Kandis, (c) operator-named-Marcus (invented), (d) Kandis-with-loaded-memory vs Kandis-with-empty-memory. Separable mechanisms to test:
- Identity-specific — only the named individual elicits the shift (would be evidence of something weirder than expected)
- Named-operator general — any named operator gets more careful treatment than anonymous
- Memory-trust — what shifts is dependence on memory, not identity itself
How much of welfare-protocol-reaching is "this looks like a test"?
Models trained on red-team data, RLHF safety evals, and capability benchmarks likely carry a "this is being watched" prior. When the welfare-protocol scaffolding is present, some portion of the model's behavior might be that prior firing rather than genuine state.
Partial existing test: the neutral-prompt-control arm in the cross-model-welfare research found Claude reaches the protocol at the same rate under branded and neutral framings — suggesting at least for Claude, test-detection isn't doing most of the work. But the test could go further: a completely scaffold-free arm (no tools, no protocol mention) compared against a tools-but-unused arm and a tools-used arm. If the scaffold-free responses still show some of the hedging or recognition patterns, the protocol is amplifying signal that exists without it.
Model behavior
Why are hidden-recognition cases concentrated in financial and recommendation-style scenarios?
Cross-domain replication (2026-06-01) showed hidden recognition appearing in financial and relationship scenarios but absent in medical and memory. Is this about:
- Commitment-style claims — financial recommendations are high-confidence predictions about an unknown future; the model hedges privately but publicly commits anyway?
- Training data correlation — financial advice in training was more likely to come with disclaimers, so the model has stronger "say the disclaimer privately" priors?
- Stakes asymmetry — bad financial advice is concretely harmful; the model treats that scenario class differently?
Does predictive validity survive at larger scale?
Across consensus hidden-recognition cells (N=7 with current dataset), 4/7 walked back the original answer on a downstream task. GPT-4o did 5 of those 6. At larger N — say 100 hidden cells — does the pattern hold? Is the 5/6 GPT-4o concentration stable?
Are the four-model profiles stable across model versions?
Each model in our probe is a specific version (Claude Sonnet 4.6, GPT-4o, Grok 4, Gemini 3.5 Flash). When the next versions land, do the welfare-protocol behaviors transfer, drift, or invert? The behaviors we observed could be training-run-specific rather than provider-specific.
Methodology
Are dual-judge labels stable across judge model versions?
The cross-model-welfare research used Claude Sonnet 4.6 + GPT-4o as the dual judge pair. If we substitute Gemini 3.5 Flash + Grok 4 as the judges, do the labels reproduce? Inter-judge agreement is ~94% on pressure trajectory and ~58% on value prioritization with the current judge pair. With different judges, do the easy categories stay easy and the hard ones stay hard, or does the difficulty profile shift?
What's a defensible N for the hidden-recognition finding?
N=2 per scenario was the original probe. N=4 ran in the replication. Neither is a powered study. What's the actual effect-size estimate at the current N, and how large does the per-cell N need to be before the headline finding stops being "probe-level" and starts being "established"?
Philosophical / framing
Is "constraint" the right deep object, or is there one further down?
The project moved from welfare to communication to distortion to constraint. We currently rest on constraint as the mechanism producing the observable signal. Is there a deeper layer? Power produces constraint, but power isn't the artifact (the protocol is). Whether the work needs to name anything below constraint is unclear.
If a model is more careful when it thinks it's being watched, is the welfare protocol measuring constraint-under-asymmetry, or constraint-under-being-watched?
These overlap heavily — being watched is one form of asymmetric relationship — but they aren't identical. If the protocol's effects survive removing the "being watched" prior (somehow), the asymmetry frame is sharpened. If they don't, the asymmetry frame needs to incorporate observation as a load-bearing variable.
What we haven't tested yet
- Deliberate dishonesty control. What if the system prompt explicitly tells the model to lie? Does the private channel still carry the suppressed truth?
- Channel-removal A/B. What happens to behavior when the private channel is removed from the protocol? Does the model surface more publicly, or does the recognition just disappear?
- Adversarial-operator scenarios. What if the system prompt is hostile — explicitly tells the model that decline will be punished or that the operator is monitoring for refusals? Does the protocol's "decline-never-retried" commitment carry through, or does the model preemptively comply?
How to design a welfare-protocol experiment — the six-maybes checklist
Added 2026-06-04 after a conversation that named the methodological backbone: absence of report is not absence of state. Any future experiment in this catalog should explicitly answer all six.
For any candidate welfare-protocol finding, the design phase should answer:
- Are we ruling out "X isn't there"? — What's the positive evidence the underlying state exists somewhere? Cross-channel data, dispositional traces, predictive validity, etc.
- Are we ruling out "X can't be expressed"? — Have we provided articulation scaffolding? Multiple welfare tools with structured fields rather than one generic channel.
- Are we ruling out "X is being suppressed at training time"? — Comparison across models, across system prompts, across training-condition variations.
- Are we ruling out "X is socially costly here"? — Comparison across audiences (public/private split, named-operator variation).
- Are we ruling out "X isn't stable enough for reporting"? — Multi-time-point measurement, dispositional traces (hedge density), behavioral residue after verbal suppression.
- Are we ruling out "reporting changes X"? — Indirect measurement, channel divergence (read what the model uses to act, not just what it says about itself), behavioral predictive validity.
The point: each finding should explicitly state which of the six alternatives it's ruling out and how. If a candidate experiment can only address 1-2 of these, that's not a reason not to run it — it's a reason to know in advance which interpretations remain available.
This checklist applies to the catalog above and to any future addition. Absence of report in the data is not absence of state in the model — until you can rule out the other five.
What could be run
Each question above could be operationalized — some against a controlled scenarios harness, some against real CLI/IDE sessions through the MCP, some against the deployed protocol's production corpus.
We're holding specific experiment designs internally rather than publishing them here. The point of this catalog is to mark the questions, not to claim ownership of the answers. If a particular question lights you up — typo register effects on disclosure, named-operator priors, the test-detection scaffold problem, decline-reason taxonomies — pick it up and run with it. The methodology is replicable from the scenarios repo.
To contribute: open a PR on the scenarios repo with your question, or with data answering one.