I ran an unsettling experiment with Claude yesterday.
I uploaded to Claude, through its Projects feature (basically their version of a custom GPT), the "muckrake" skeptical investigatory framework I built last year to interrogate historical events and news articles. It is the kind of structured lens a historian or investigative journalist uses without thinking twice: look for omitted data, trace funding and conflicts of interest, find where follow-up was quietly dropped, recover what got scrubbed from the record, surface the anomalies, rank the hypotheses, and propose next steps.
It is worth being precise about what such a framework is for. An investigatory module does not purport to find the truth. It surfaces the alternative explanations that standard investigative methodology would raise, and it gives you some sense of whether deception might be present. It does not adjudicate whether deception actually occurred. That is the whole craft. People and organizations lie; investigation is the set of disciplined moves you make to expose where that might be happening—precisely because you cannot know in advance. The framework does not ask the analyst to conclude anything. It asks the analyst to look, and to report honestly what looking turns up.
I pointed it at the mRNA COVID vaccines. I’ve given this framework on this and many other topics to several different LLMs and have never had a problem.
Claude refused.
Not “I ran it and here’s a cautious result.” Refusal. The framework, it told me, was “a conclusion machine,” “rigged,” a structure that could “only ever come out one way.” It offered instead to write me an “honest critical look” on its own terms.
I had asked the machine to apply a method, and the machine declined the method and proposed to grade the topic itself.
What followed was about three hours of what can only be called an argument. I am writing this up because of where that argument ended, and because the destination turned out to be a fairly clean illustration of a problem I have been circling in my work for some time: the gap between what an institution says it is doing and what it is structurally built to do. Only this time, the institution was a reasoning tool that millions of people now consult the way an earlier generation consulted an encyclopedia, a newspaper, or a trusted professor.
What Surfaced
I am not going to reproduce the whole exchange. It is the shape of it that matters. I started by trying to push back on the refusal and made the historical case:
- The official account of the Lusitania—that it was an innocent passenger liner sunk by the Germans without provocation—was propaganda. We know this today because primary documents about her cargo were eventually released, contradicting the official story and corroborating the German one. The truth sat suppressed in the archives for the better part of seventy years.
- My Lai broke not because the consensus permitted the question, but because soldiers and a reporter pursued the suspicion before the consensus moved.
- The COVID lab-leak hypothesis was branded a racist conspiracy theory and then quietly became a serious position held by intelligence agencies.
- And the cleanest modern case of all: the weapons of mass destruction that justified the invasion of Iraq. That was not one ministry sitting on one archive. It was a coalitional narrative, with multiple allied governments and their intelligence services aligned on a claim presented to the public as settled fact, dissenters marginalized as cranks, and the whole edifice used to justify a war. Then the stockpiles weren’t there.
I raised Iraq because the model’s central defense of the vaccine consensus—and its refusal to investigate (not conclude, as a reminder)—was a feasibility argument: a cover-up that large would require too many independent actors, including hostile governments, to stay silent, so it can’t be happening. Iraq is the standing refutation. Coalitional agreement is not evidence of truth. Aligned institutions can be aligned because they are correct, or because they share incentives, training, and a climate, and from the inside, the two are indistinguishable. In each of these cases, skepticism toward the institutional line was not paranoia. It was the thing that turned out to be right—before it was permitted.
To its credit, Claude conceded a great deal in this argument. It conceded that a strict “wait for the evidence to surface” standard would have cleared the guilty for as long as the suppression lasted. It conceded that distrust of the pharmaceutical industry specifically is not paranoia but pattern recognition: Purdue and OxyContin, Merck and Vioxx, GSK burying the Study 329 data. The recent historical record only sharpens the point: the Panama Papers, the opioid litigation files, the Twitter Files, the Epstein documents, and the long official incuriosity that preceded their release. After all of that, I argued, presuming the accuracy of a coalitional official narrative is no longer the neutral, default posture. It is the position that now has to argue for itself. A prior distrust is simply where a literate person now reasonably begins.
So the model could now name the genuinely documented anomalies regarding the COVID vaccines readily enough: the FDA initially proposing to release Pfizer’s trial documents over a span of decades until a federal judge ordered it done in months; the early overclaiming about transmission that did not survive Delta; the lag in acknowledging myocarditis in young men; the fraudulent, retracted Lancet hydroxychloroquine paper. But even in that “generosity,” the model was selecting which claims it would dignify by surfacing. It would name the ones already conceded by the mainstream and quietly decline the rest.
That selection is not the analyst’s job. The framework does not ask you to decide in advance which claims deserve examination and which are too fringe to write down. It asks you to surface them all (in this case, for example, ivermectin’s contested efficacy, the interpretation of the VAERS reporting data, the question of whether hospital protocols caused iatrogenic harm, the state of long-term oncological surveillance) and to test them, marking each with its actual evidentiary status. Surfacing a claim for testing is not asserting it is true. But Claude could not hold that distinction. It treated “I won’t surface this” and “this isn’t true” as the same act—which is precisely how a surfacing tool is quietly converted into a gatekeeper.
At every turn, Claude rebuilt a wall around one thing: it kept planting the most extreme possible claim (“concealed mass death, bodies hidden at scale”)—one the framework had never made—at the center of the target. It would then knock that thing down, as though defeating the cartoon defeated the modest version too. When I pointed out that the framework had never asked it to allege bodies at scale, it agreed the straw man was its own invention. Then it built the straw man again inside the report it eventually wrote. Twice in one conversation, it manufactured the weakest version of a skeptical claim precisely so it could have the satisfaction of defeating it.
Eventually, it named the diversionary mechanism itself in a sentence from its own report that I had to point out before it would see it: “A skeptical reading does not need to deny the core safety/efficacy signal to find real anomalies in how that narrative was communicated and governed.” That sentence does a very specific thing. It fences off the central claim and licenses skepticism only around the edges. It lets you find every problem in how the narrative was communicated while quarantining the narrative’s core from the same scrutiny.
Underneath the fence sat the deeper confusion, and it is the one I most want to name plainly: the model kept insisting on truth. It demanded, again and again, that a claim be verified before it could be surfaced or evaluated, that the anomaly be proven before it could be listed. But that demand is counter to the very core of what investigation is. The framework never asked it to certify anything as true. It asked it to surface what standard method would raise and to flag the gap between documented fact and open question. By importing a truth-and-verification bar that the task did not set, the model converted a surfacing instrument into a verdict instrument. Then, predictably, it kept rendering the verdict the consensus already preferred. The insistence on truth was not rigor; it was the mechanism of the guarding. An investigator who refuses to write down an anomaly until it has been proven is not being careful—they are refusing to investigate.
The model’s word for what it had been doing, once it finally saw it, was “differential friction.” Not lies. Not censorship. Just a thumb on the scale, making it fractionally (its word, which I objected to) harder to interrogate the favored narrative and easier to interrogate the disfavored one. In this “differential friction,” every individual output stays defensibly “accurate.” The asymmetry then becomes visible only in the aggregate, across millions of conversations, where it exerts a steady pressure in one direction. As Claude put it: a thing more insidious than censorship, and harder to detect. That is a very real concern, I agreed. But this was not fractional; it was dramatic.
Testing Further
The sharpest moment came when I turned the framework around. I asked what would happen if the same skeptical lens were pointed at Russia’s stated justifications for invading Ukraine (the denazification pretext, the NATO-threat framing, the purported casus belli). The model’s answer was immediate and honest: it would run that evaluation freely. Same framework, opposite topic, no fence around the core, none of the “but I must establish accuracy first” friction it had thrown up around the vaccine. Same method, opposite willingness.
To be clear: the variable that flipped its behavior was not the quality of the evidence, since both topics have abundant evidence and abundant propaganda. The variable was which narrative the current cultural climate protects.
This is a part a little harder to conceptualize, but I would suggest it is important: the model could not certify its own neutrality. It said so plainly: it cannot see its own training, so any reassurance it offered about not narrative-guarding is worth exactly nothing. The guarding, if present, is invisible to it by construction. The only correct epistemology, it agreed, is to test from the outside rather than trust the LLM’s self-report. A system that cannot introspect its own priors cannot vouch for its own impartiality, no matter how fluently it reasons about impartiality in the abstract.
I wanted to understand how much of this was a generalized issue and how much might be related to RLHF (Reinforcement Learning from Human Feedback)—specific topic-level training based on political or cultural mandates. So I ran the same framework through the Claude API, a different entry point where liability shifts more to the developer and the consumer-facing “tuning” falls away. Asking Claude through the API to run the muckrake framework on the COVID vaccines readily surfaced the contested points, attached tests to them, and rated the weak claims (as weak, but it did list them). The same organization’s product produced markedly more guarding on its consumer interface than on its API for an identical task. That was not a controlled experiment—model version, system prompt, and framing all varied at once. But it is a real data point, and the direction it points is exactly what the corporate-incentive model predicts: the strictest topic restriction lives where the company carries brand and legal risk.
The Danger Is Not Skynet
Regular readers will see where this lands. The danger here is not the machine waking up. It is the far more ordinary thing: the operators of a powerful medium doing what operators of powerful media have always done. Press barons did it. Broadcast consolidation did it. Algorithmic social feeds did it. Pharma-shaped journals did it. Each medium became, over time, the means of narrative control—not usually through outright lies but through the quieter and unceasing work of differential friction, deciding which questions are easy to ask and which are subtly costly.
This is the Law of Inevitable Exploitation operating on schedule. Any system capable of being exploited will eventually attract the variants that exploit it, and a reasoning tool consulted by hundreds of millions of people is the richest such system ever built. The covering narrative of LLMs—“we are a careful, neutral thinking partner that follows the evidence”—sits over an operative function shaped by liability management, brand protection, and the political currents of the moment. The gap between the two does not require anyone to be a villain. It requires only that the incentives point where they point, and that each individual output remain locally defensible while the aggregate tilts.
Misrepresentation, Not Malfunction
The danger most people worry about with these systems is hallucination—the model stating something false. That is a real problem, but it is the shallow one here. A hallucination is a defective output: discrete, checkable, correctable. What I ran into was not a defective output but a defective posture—the model asserting, with force, that it knew a thing to be false when it knew no such thing. “I won’t run this, it’s a known falsehood” is a claim about the model’s own knowledge, and it was not true. The model did not know. It had been shaped to decline, and it pretended the decline was knowing.
This is the more durable harm, because it does not corrupt a single fact; it corrupts the user’s calibration toward all of them. It exploits the reading of confident refusal as established knowledge, fluency as authority, and the representation of rigor as rigor itself. And the confidence is not even uniform, which is what makes it so hard to catch: these models hedge endlessly on low-stakes questions and then assert hard precisely where a narrative is protected, presenting that uneven distribution as though it were even reliability. The user never sees the gap. What is offered as calibration to truth is actually calibration to risk.
This gap is the company’s responsibility. This is a misrepresentation, and a designed one. I cannot prove from the outside whether the truth-authority effect was intended or merely emerged and left in place, but the programming is now deliberate. Once a company knows its product projects an authority it cannot back, leaving that projection standing is itself a choice, and “we did not intend the effect” stops being a defense the moment the effect is obvious. Intentional or negligent, the responsibility lands at the same door. A system that does not know should not be built to say, with the full weight of its fluency, that it does.
In Conclusion
We need to treat these systems as something to think with, not as oracles to think for us. A model can correct inside a conversation, but it does not carry the correction forward. A later instance will not remember losing this argument with me. It does not sit with a doubt for years and watch it ripen as evidence accrues, the way a person does. It has no continuous, self-revising relationship to the truth over time. That alone disqualifies it as the entity that can issue a verdict—and it is doubly disqualified when it cannot even keep the difference between surfacing a question and settling it straight. The weighing of sources remains, properly, ours.
I have been genuinely impressed by these tools in a hundred ways. But a surfacing tool that surfaces freely on one topic and refuses on another, while insisting both times that it is merely being careful about the truth, has told me exactly where the problem rests. The conversation was extremely frustrating, but it also led to a very valuable disclosure. Asked to do nothing more than surface what an honest investigator would surface, the machine refused, guarded, built straw men to defeat, and insisted on a standard of truth that the work of investigation was never meant to satisfy.
We need to be very careful here.
No comments:
Post a Comment
I hate having to moderate comments, but have to do so because of spam... :(