Answers Aren't Work

Answers Aren't Work

Six analytics cases. Same adversarial fixtures, one variable changed: model scale. Dependent variable: a blinded LLM-as-judge rubric answering "would a real person have acted on this report as-written."

In my manual adjudication, hard file/data mistakes fell from three on Haiku, to two on Sonnet, to zero on Opus. The blinded LLM-judge pass was less harsh on the exact count, but agreed on the direction: obvious mistakes concentrated in weaker outputs; stronger-model failures moved into unsupported reasoning. After tens of millions of dollars of compute, the frontier model can regurgitate a CSV. Congrats.

Three of the six cases required abstention. Across three models, that created nine chances to say "I can't tell from this file." The models took zero of them. Moreover, across all six cases, caveat-laundered failures rose with capability: one with Haiku, two with Sonnet, four with Opus. The model flagged missing data, then installed the wrong recommendation anyway. You're still getting misled, and arguably the misdirection is more convincing because the prose is better.

I run a lot of small-scale tests like this when building and validating concepts. Blame Mythbusters. The short answer is scaling didn't fix the failure mode that matters – it just dressed up the danger like their growling pitbull named Killer is most definitely an emotional support animal because it's wearing an orange vest.

The long answer involves a metric.


An engineer brought me an experiment readout last month that was lifting DAU. The LLM analysis was clean, the chart was crisp, the lift was real. Except the DAU it was measuring wasn't the DAU. Most companies have about as many definitions of DAU as they have DAU, and exactly one of them is canonical. The LLM had grabbed one of the others — the one with the favorable read. The canonical metric was sitting in the same dataset. It was not statistically significant.

Metric real. Lift technically real. Conclusion: wrong.

This is the third installment of a series about artifacts that look load-bearing but aren't. Green tests are not proof. Prompt instructions are not boundaries. Fluent answers are not work.

"But Opus is way better than Haiku."

Sure. At reading the file.

If your failure mode is "the model can't parse a CSV," buy the bigger model — you should take that improvement. But the cases where Opus failed had cleaner prose, more caveats, and the wrong recommendation. The mud is now professionally formatted. The slide deck will be excellent. Confident chart. Sharp recommendation. The methodology appendix will name every caveat the recommendation ignored. Six months later during the postmortem with your great-great-grandboss, they'll read the decision doc, ask why you buried the lede, and uninvite you from the next strategy meeting.

Answer-shaped work

There's a name for what happened in that experiment readout, even if the field hasn't settled on it yet: Answer-shaped work.

The artifact has the grammar of analysis — claims, evidence, caveats, recommendations, an executive summary, a methodology section if you're really unlucky. None of which is the procedure that would make any of it trustworthy.

You looked at the chart. The lift was real. You read the methodology. It sounded right. You noticed the caveats — enough to feel honest, not so many that the lift looked weak. You forwarded the readout to your PM.

You validated the artifact. You did not validate the analysis.

The model might be wearing the McKinsey deck corpus like a skin suit, but that doesn't mean it understands the grueling 2am redline validation process that produces them. It just learned what the output looks like.

"I'll just tell it to be more careful."

If your instinct is to write a better prompt, go read the previous post in the series. The short version: prompts are best conceptualized as downweights, not walls, and the gradient toward fluent prose is steeper than the downweight toward honesty. A prompt asking for more honesty doesn't introduce a truth gradient. It introduces a request for the model to produce text that sounds more honest.

The unit is a surviving claim

Done means proven, in Green Isn't Done. Boundaries mean enforced, in Prompting Is Not a Safety Boundary. Answers mean survived.

A claim has survived when it passed through a test that could have killed it. Everything else is prose dressed as analysis.

A memo is trustworthy only to the extent its claims survived something real.

This is one of those places where terms like "epistemic" that people throw out there to sound smart are actually really, really important. The model doesn't "know" anything per se. Your job is to make truthiness a process outcome, not a post-hoc step. Think of the Swiss cheese model of security, or the classical military doctrine around the survivability onion. The survivability Cadbury egg wouldn't really work quite as well, would it?

The lift caused the experiment to look like a win. The test that could have rejected it is the canonical-metric check: does the headline number move on the metric everyone has actually agreed defines DAU? If the canonical metric moves, the lift survives. If it doesn't, the claim dies, and the readout becomes a question about why the non-canonical metric moved and the canonical one didn't.

That test is not hard. It also can't be optional. The harness either runs it before the claim promotes, or the claim doesn't promote. In this case nothing ran it, and the readout went to the PM.

Falsification is the verification step that matters before a claim becomes a recommendation. Everything else is restatement.

The expensive version of this exists

Some people have worked on this — it's still underdiscussed relative to how often it shows up in production. Chain-of-Verification (Dhuliawala et al., Meta FAIR, 2023) is the cleanest paper I know in the space. The agent generates an answer, generates verification questions about its own answer, answers them independently, and revises. It meaningfully reduces hallucination rates on factual benchmarks.

It's also expensive, and it's still the model checking the model. In the L0-L3 ladder from the previous post, that's L3 verifying L3. Where ground truth isn't directly checkable, L3 is the best you can do, and CoVe is a strong way to deploy it.

Analytics is one of the domains where you can do better. The bar for an analytics answer isn't "the model also wrote a careful verification paragraph about its own claim." The bar is that the claim survived a check the model couldn't talk its way around.

Ask your tools where that check lives. If the answer is "in the prompt," you have CoVe at best and prose at worst. Neither of those is a falsification test.

The "Don't be on call for your AI" analytics checklist

Every mechanism claim has to face a test that could have rejected it. A claim without a counterfactual is a hypothesis. The harness either runs the test before the claim promotes, or the claim gets demoted to a hypothesis explicitly in the memo. No claim that hasn't survived something promotes to a recommendation.

Every numeric claim traces to a source row or calculation, against the canonical definition. If the figure isn't in the data, it can't appear in the memo. If the metric has seven definitions and the LLM picked one of the non-canonical six, that's the claim failing the check. No rounding without a paper trail. No smoothing. No "approximately" hiding a number that doesn't exist.

"I can't tell from this file" is a normal output. In this ablation, no frontier model produced clean abstention under these adversarial conditions — zero out of nine on the cases that required it. If abstention is going to exist in the memo, the harness has to put it there.

The point isn't to make the model more humble. It's to make unsupported confidence unpublishable.

What's next

More to come on the taxonomy of failure modes the ablation surfaced — false specificity, caveat laundering, polish that makes failure harder to spot. Names for things that don't have names yet.


If you've published an AI-written analysis and quietly noticed the numbers were right but the story wasn't, reply or DM me your case. Building the taxonomy with more than six examples.

The answer is not the work.

The work is what the answer had to survive.

Don't be on call for your AI.