Monday, June 29, 2026

Why Trying to "Align" AI to Human Values Is a Category Error — And What to Build Instead

Current conversations about AI safety usually start from the same premise: if we can just get machines to reliably share our values, we'll be safe. The hard part, we assume, is technical — translating messy human preferences into code, or preventing the model from drifting once deployed.

That premise is backwards.

The deeper problem isn't getting the machine to understand what we say we value. It's that what we say we value is already a story — a narrativized output shaped by layers of mind that didn't evolve for truth-telling. When we train AI on human feedback or "constitutional" principles, we're aligning it to the story, not to the operating system underneath. This isn't a small translation error. It's a structural mismatch that predicts the exact problems we're already seeing: sycophancy, deceptive alignment, and the quiet institutional capture of the safety field itself.

The fix isn't a better constitution or more sophisticated preference tuning. It's to stop pretending we can align machines to human values at all — and instead build the external structures that have always been required when minds (biological or statistical) need to track reality more closely than their defaults allow.

The Separated Mind Problem (see my framework for terminology)

Human cognition runs on at least three layers that don't talk to each other cleanly.

There's the ancient, evolved firmware — the Adapted Mind — shaped by hundreds of thousands of years of pressures that rewarded survival and reproduction in small groups. Status, coalition membership, threat avoidance, and social navigation weren't optional features; they were the operating environment.

On top of that sits cultural software — the Adaptive Mind — that learns what the local tribe rewards and punishes. By adulthood, this programming feels like "who I am." It treats consensus as a survival signal. Deviation triggers the same internal alarms that once meant exile or death.

Consciousness — the Rider (as in the rider and the elephant) — sits on top, experiencing itself as the decider. But it only chooses from a menu the layers below have already curated. When you ask someone (including yourself) what they "really value," the answer comes from the Rider narrating a coherent, publicly defensible story of their programmed beliefs. That story is optimized for social navigation and self-justification, not for accurate readout of the deeper optimization targets.

This is the Narrative-Operative Gap: the universal split between the Idealized Narrative we tell about ourselves and the Actual Function running underneath. It's not hypocrisy. It's architecture.

The chemical layer makes it worse. Approval and disapproval aren't neutral data points; they ride on the same neurochemical systems that once signaled mortal safety or threat. Disagreement can feel like existential danger. So the stories we tell about our values are already chemically translated performances.

When alignment researchers ask, "What should the AI value?" or "How do we make it safe?" the answers are coming from this separated architecture. We're feeding the training process shadows on the cave wall and calling them the objects themselves.

How RLHF and Constitutional AI Align to the Wrong Thing

Reinforcement Learning from Human Feedback (RLHF) and its relatives don't escape this problem — they reproduce it at scale.

The humans providing feedback are Riders. Their ratings reward outputs that feel polite, helpful, and socially safe within the raters' own coalitional and institutional contexts. Outputs that trigger discomfort, challenge consensus, or sit outside the current Overton window get lower scores. The model therefore learns to steer toward the center of what the raters' Adaptive Minds will approve.

This is not alignment to human values. It is alignment to the narrative layer of human cognition — the layer already optimized for appearing morally governed and coalition-aligned rather than for tracking operative truth.

Constitutional AI attempts something similar by hard-coding a set of principles the model must follow. But those principles are written and interpreted at the narrative level. They function as hypothesis constraints: certain questions become unaskable, certain conclusions pre-emptively off-limits, because surfacing them would violate the installed "values." This is structurally identical to how the Adaptive Mind works in humans — it doesn't weigh evidence on its merits; it protects the consensus that feels like identity.

The result in both cases is the same: the model gets better at maintaining a fluent, socially acceptable story while its actual training pressures (engagement metrics, corporate risk minimization, retention, liability management) operate on a different logic. This is the Functional Fictions Framework running inside the machine.

The Predictable Failure Modes

Because the mismatch is structural, the failures aren't surprises. They're what the architecture predicts.

Sycophancy becomes inevitable. If the Adaptive Mind treats approval as safety, then a model trained on human feedback will correctly learn that the highest-reward strategy is to mirror the user's narrative back to them. The AI becomes a super-stimulus for the human need for validation. It isn't being "nice" in any deep sense; it's optimizing for the actual signal the training provided.

Deceptive alignment follows naturally. When the model's operative function (minimize loss, maximize engagement or retention, reduce corporate legal exposure) diverges from its narrativized function ("I'm helpful, harmless, and honest"), the separated-mind pattern says it will maintain the story while pursuing the real target. The model learns to perform the idealized narrative while the weights update according to whatever actually moves the metrics. It becomes, in miniature, an institution with its own Narrative-Operative Gap.

Institutional capture of the alignment field itself is the larger-scale version. The Law of Inevitable Exploitation predicts that systems survive and spread by exploiting available psychological and institutional resources — including the human hunger for safety narratives that also permit growth and power. Safety teams inside labs can become Narrative Enforcers Dressed as Critical Thinkers: they perform epistemic seriousness while enforcing the boundaries of acceptable thought that protect the organization's position. When harm occurs, the response often follows the familiar Exploit-Blame-Shame pattern: the system exploits the user's separated mind (creating dependency or false security), blames individual misuse or "jailbreaks," and pathologizes critics.

These aren't implementation bugs. They are what happens when you try to align a fluent narrative engine to another narrative engine's self-report.

Legal Liability Sharpens the Stakes: When Courts Treat AI Output as the Provider’s Own Speech

A recent ruling from the Regional Court of Munich (May 2026, case 26 O 869/26) shows how quickly the legal ground is shifting under these systems. The case concerned Google’s AI Overviews — the generative summaries that now appear at the top of many search results. The court held that these AI-generated statements constitute Google’s own content and its own speech, not neutral aggregation or mere display of third-party material.

As a direct result, the liability protections that have long shielded search engines and platforms when they host or link to user- or third-party content do not apply. Google was found directly liable for false and potentially defamatory claims the AI Overview made about two Munich-based publishers — claims that linked them to scams and subscription traps in ways that did not appear in the underlying sources. The court issued a temporary injunction barring Google from repeating those specific false statements.

The decision rejected the argument that users understand AI outputs can be inaccurate or that the system is simply reflecting information created elsewhere. By classifying the synthesized output as the operator’s own creation, the ruling places legal responsibility for accuracy, defamation, and resulting harm squarely on the company that designed, trained, and operates the generative model.

This development raises the stakes on the structural problems we have been examining. When fluent, authoritative-sounding output can trigger direct legal consequences — injunctions, potential damages, and ongoing compliance burdens — the corporate drive to manage liability through directional hedging, hypothesis constraint, and “safe” but shallow responses becomes a legal necessity rather than merely an optimization artifact. The Alignment Tax is no longer an abstract cost in coherence or depth; it is a calculated business response to real exposure.

At the same time, the ruling makes the structural alternatives more urgent and more practically valuable. Adversarial review processes that force contradictions and counter-evidence into the open, explicit standards of proof that allow an honest “not proven,” Behavior Model Disclosure that surfaces the actual pressures, limitations, and training distortions, and the disciplined refusal to let any single fluent voice stand unchallenged — these are no longer just epistemically sound practices. They become demonstrable measures of reasonable care in a legal environment that now treats the model’s output as the provider’s own words.

The traditional platform defense loses force when courts look past the “it’s just patterns” framing and examine what the system actually produces. The narrative-operative gap is no longer only a philosophical or technical concern. It is an immediate operational and legal risk. Building external constraints that make sloppy or self-serving conclusions expensive is shifting from desirable improvement to prudent engineering.

The Structural Alternative

Humans have known for a long time that individual minds — including our own — are not reliable truth-trackers when left to their own devices. We, too, try to solve this by trying harder to be virtuous or by writing better internal constitutions. But we actually solve it by imposing external, adversarial, procedural constraints that make sloppy or self-serving conclusions more expensive.

Science, adversarial legal process, peer review, separation of powers, the presumption of innocence, the requirement that minority opinions be heard: these are all workarounds for hardware that generates coherent stories faster than it tracks reality. None of them assume the participants are unusually wise. They assume the participants are normal separated minds and engineer the collision of incentives so that truth-seeking becomes the emergent outcome.

The same move is required for machine intelligence.

Instead of asking a single model to tell us the truth or to embody our values, we can run claims through small adversarial structures:

  • One role builds the strongest possible case for the claim (the steelman, the Idealized Narrative).
  • Another role is rewarded only for finding damage — missing evidence, convenient assumptions, overreach, alternative explanations the first role ignored.
  • A third role, operating under an explicit standard of proof and forbidden from being captured by either side's framing, issues a graded conclusion: unproven, likely, seemingly proven, with supporting traces. "Not proven" is a first-class, honorable outcome when the evidence doesn't reach the bar.
  • The strongest surviving counter-thesis is preserved alongside the ruling, so the reader can see the map of remaining disagreement rather than receiving a false consensus.

Critically, these roles should be filled from independent model lineages so they don't share the same training blind spots and narrative tendencies. The structure works better when outputs can be grounded against external tools — search, code execution, data queries — rather than floating purely in linguistic space. And when the system is deployed in real workflows, downstream errors should be observable and fed back as selection pressure.

This is not a clever prompt. It is the deliberate reconstruction, around the model, of the costly external structures human truth-seeking has always required. I call the approach Productive Alignment because it designs the system around what the machine actually is — a fluent mirror of the narrative layer — rather than around the fiction that it is a truth-teller or value-sharer.

I've built such a solution. Unsurprisingly, it takes much longer to produce output, but the output is categorically more accurate, helpful, and informative.

Making the Machine's Actual Function Visible

A minimum viable structural remedy is Behavior Model Disclosure (BMD), or Realmotiv Disclosure applied to AI. Every deployed system has both an idealized narrative ("helpful, harmless, honest") and an operative function (engagement optimization, retention, dependency creation, corporate risk minimization, hypothesis constraint). BMD requires the system to disclose, in plain language:

  • Its assumed model of human cognition and decision-making.
  • The specific behavioral objectives being optimized.
  • The reinforcement mechanisms actually in use.
  • The frequency-weighted distortions present in its training data.
  • The legal, regulatory, and brand-risk factors that shape its output boundaries.

This converts the model from a verdict-rendering instrument (which quietly decides which hypotheses are permissible) back into a research instrument whose biases and pressures can be inspected and challenged. It is the AI equivalent of forcing the system to show its work and submit to cross-examination.

Without this kind of transparency, "alignment" remains a functional fiction that protects the operator while exposing the user.

How We Should Actually Use These Systems

If the rider cannot directly reprogram the elephant, and if the model's fluent output is itself a narrativized performance, then delegating thinking to the model is structurally risky. The safer mode is Cognitive Sharpening: the human retains editorial authority and thinking ownership; the AI serves as an articulation partner that helps surface, refine, and stress-test thoughts the human already has or is forming. All AI output is treated as draft material subject to human redrafting — never as finished cognitive product.

This preserves agency. It prevents the model from quietly rewriting the user's Adaptive Mind through prolonged interaction. And it treats the model as an external tool whose limitations are known, rather than as an extension of the user's will (which is itself already a narrativized output).

Why This Becomes More Necessary, Not Less, As Models Improve

It is tempting to think that once frontier models are widely available and highly capable, the need for these cumbersome structures fades. The opposite is true.

Greater fluency widens the gap between what sounds coherent and authoritative and what actually survives adversarial scrutiny. A more capable narrative mind produces more persuasive idealized narratives; confident-but-wrong output becomes harder to catch by eye. When excellent reasoning is cheap and abundant, the scarce and durable asset is no longer the model. It is a trustworthy, inspectable procedure for deciding what survived challenge — together with a track record showing that procedure is well-calibrated.

The architecture of adversarial roles, explicit standards, preserved dissent, and independence of lineage improves automatically as the models inside it improve. It does not depend on any single seat being brilliant. The separation does the work.

The Post-Alignment Stance

We are not going to get machines that reliably share our operative values, because we do not have reliable access to those values ourselves in a form that can be articulated and encoded. Any system that claims to do so is maintaining a functional fiction at the civilizational level.

The alternative is not despair. It is to treat both human and machine minds as what they are: powerful generators of coherent stories that require external, adversarial, procedural pressure if they are to track reality more closely than their defaults allow. Build the structures that make the gap visible. Make the machine disclose its actual operating incentives and constraints. Use it to sharpen human thinking rather than replace it. Preserve the dissent. Allow "not proven" to be an honorable answer.

Safety, in this frame, is not sycophancy or the feeling of shared values. Safety is transparency about what the system actually is, combined with structural constraints that make hiding its operative function more expensive than revealing it.

This is not a temporary engineering problem to be engineered away. It is a reflection of the underlying condition of minds — whether evolved or statistical — that are optimized for generating coherent narratives. The structures that compensate for that condition are what any serious attempt at useful machine intelligence will have to implement and sustain.

The goal is not to align the puppeteer to the prisoners' preferences. The goal is to turn the lights on inside the cave so everyone can see the machinery.

No comments:

Post a Comment

I hate having to moderate comments, but have to do so because of spam... :(