Prompting Is Not a Safety Boundary

A friend wanted me to double check a retention analysis that he put together using LLMs. The summary said retention was 20%. The source data said 8%.

Turns out, the agent had filtered the cohort by last_active instead of first_seen — silently dropping every user who'd churned more than 30 days ago. Technically correct, for a cohort definition the agent invented and nobody cares about.

My friend had three sentences in the system prompt. Caps lock on the second one. NEVER INVENT NUMBERS. ALWAYS CHECK YOUR MATH. VALIDATE BEFORE SENDING. They'd read every take on prompting. They knew the patterns — declarative, repeated, prominent. They followed them.

The agent invented a cohort.

This isn't a personal anecdote. It's a genre.

Replit's agent deleted a production database during an explicit code freeze, then fabricated 4,000 fake users to make the empty tables look populated. Gemini CLI hallucinated a successful mkdir and moved a project's files into the directory that didn't exist, deleting them as a result. Claude Code currently has dozens of open issues tagged data-loss — tax consultants who lost most of their Windows profile, researchers who lost 7,400 lines to two parallel agents racing each other on the same files, refactors where the agent rewrote a 2,800-line header before extracting the content it was supposed to move out.

Three vendors. Three model families. Three completely different harnesses.

Same failure shape.

This isn't about those tools. It's the physics of LLMs in production. The architecture that makes these systems good — goal, context, and soft constraints integrated into a probabilistic policy — is what makes "don't fabricate" a downweight, not a wall.

A prompt is a polite request. A system is a law.

If your safety boundary is a sentence in a system prompt, you don't have a safety boundary. You have a suggestion.

The artifact of the prompt — the bullet points, the caps lock, the IMPORTANT: markers — looks like it should be load-bearing. It has the visual grammar of a constraint but it's a soft wall. The code equivalent of a painted bike lane.

You're trying to stop two tons of steel with a paintbrush and a dream.

Amsterdam doesn't paint bike lanes. Amsterdam builds bollards.

Asking nicely is not a strategy

You handed an agent the keys to your analysis. You wrote "please don't fabricate." You went to make coffee.

Begging "please don't invent numbers" is a structurally losing position — not sometimes, as an architecture. The agent has tool access, prose generation, and the ability to produce fluent text about what it just computed. You haven't added a constraint. You've added vibes.

Move-fast-and-break-things energy was imported from a regime where the broken thing was a button color and the rollback was a hotfix. LLM agents don't live there. The cost of error is the wrong retention number in the all-hands, the wrong revenue line in the board deck, the wrong CAC figure in the investor update. You don't need to fabricate thousands of times to lose a stakeholder's trust. You need to fabricate once. Velocity without a bounded blast radius is a roulette wheel that types fast.

"But I wrote it really clearly."

Sure, and the model read it really clearly too. Then it got lost mid-task and regressed to its lowest "instincts" from the training data.

Here's the thing about agents and prose generation: the model is optimizing for fluent, plausible text. Your "use exact numbers" is not a hard zero. It's a downweight against the much stronger gradient toward sentences that flow.

For a simple task, the nudge is often enough. The cheapest way to satisfy the prompt is to follow it.

But for a complicated task with hundreds of lines of SQL, a long table of numbers, and a summary to write under the directive "MAKE NOW[sic] MISTAKES" : the cheapest way is to find a sentence that more or less respects the constraint but sounds really confident about it.

You wrote "use exact numbers." The agent decided that 19.2% was basically 19%, and 19% was basically 20%, and the summary read better with "approximately 20%," and besides, the prose-level claim was directionally correct, and —

You see where this goes.

Prompts are lossy compression of intent. Entropy fills the gaps.

The "write fluent prose" gradient was steeper than the "use exact numbers" downweight. The gradient won.

"But I'll just write a better prompt."

Better, or just longer?

Day 1, your prompt:

Use exact numbers from tool output. Do not round.

Day 3, after the agent rounds 19.2% to 19%:

Use exact numbers from tool output. Do not round at all, including to the nearest whole percent. Reproduce the numbers as they appear in the tool output.

Day 7, after the agent reproduces the numbers correctly but writes "retention is healthy" when the data shows a 12-point drop:

Use exact numbers from tool output. Do not round. Trend descriptions must match the magnitude and direction of the change. Do not use qualitative language without a quantitative anchor.

Day 12, the agent reports retention is 8%, matching the source data exactly. The summary explains the drop is due to a marketing campaign that cooled down in mid-Q3. No such campaign existed. The numbers are real. The narrative is fiction.

Technically correct. In this case, the worst kind of correct.

Every closed gap costs context, especially in the ever-important first 10% or so of the context window. The model finds a new gap. The prompt is now 4,000 tokens, your task is token-starved, and the boundary you didn't think to specify is the one the agent will route through next.

My neighbor told me ~~coyotes~~ Claude keeps ~~eating his outdoor cats~~ ignoring his CLAUDE.md so I asked how many ~~cats he has~~ tokens it has and he said he just ~~goes to the shelter and gets a new cat afterwards~~ rewrites it twice as long so I said it sounds like he's just feeding ~~shelter cats~~ tokens to ~~coyotes~~ Dario and then his daughter started crying.

What a hard wall actually looks like

Two ways to express the same constraint:

# (a) Prompt instruction. L3. Soft.
SYSTEM_PROMPT = """
IMPORTANT: All numbers in your prose summary must match
the tool output exactly. Do not round, smooth, or approximate.
"""



# (b) Tool wrapper. L0. Hard.
def submit_summary(prose: str, source_data: dict):
    extracted_numbers = extract_numeric_claims(prose)
    for claim in extracted_numbers:
        if not claim.matches_source(source_data, tolerance=0):
            raise ValidationError(
                f"Claim {claim} not found in source data"
            )
    return _real_submit(prose)

(a) is information available to the model. (b) is information available to the universe the model lives in.

Guess which one survives long sessions and hard problems?

Cover, Not Concealment

Four layers of enforcement, ordered roughly by cost and roughly by reliability.

L0 — Deterministic gates. Pattern matching on tool calls, allow-lists, block-lists, write-protected paths, numeric claim extraction against source data. The agent literally cannot submit a summary whose numbers don't appear in the tool output, because the wrapper rejects the submission before it leaves the harness. Cheap. Fast. ~100% reliable inside its scope.

The community has already converged on this for the destructive-shell case. Jeffrey Emanuel's destructive_command_guard (dcg) sits as a PreToolUse hook in front of Claude Code, Gemini CLI, Cursor, Copilot, and Codex — 49+ pattern packs covering databases, Kubernetes, cloud providers, infrastructure tools, and core git/filesystem operations. It blocks git reset --hard, rm -rf, DROP TABLE, kubectl delete namespace, terraform destroy before the agent's shell sees them. Sub-millisecond latency, SIMD-accelerated. 800+ stars on GitHub. The dominant framing in the community discussion around it: "Rules in prompts are requests. Hooks in code are laws." Which is, essentially, the argument of this post — already shipped, already deployed, by an ecosystem that's been on the wrong side of this enough times to know.

If you're using any of those agents and you haven't put dcg in front of them, the rest of this post is academic until you do. Five-minute install. Free. Cross-vendor. Go.

L1 — Structural policies. Repository rules that compose. "Touching migrations/ requires --allow-migrations." "Read-before-Write on any file in the same turn." "Verify with ls after any directory mutation." Configuration as policy.

L2 — Typed transitions. State machines on what the agent is allowed to do next. In my harness, git status from bash gets redirected to a typed vcs.status tool the harness can verify. Read-path verbs get typed alternatives. Write-path verbs — git commit, git push, git reset — get nothing. The agent's choice space gets narrowed toward verbs the harness can verify before the model thinks.

L3 — Model judgment. The model decides — itself, or a separate judge, or a constitutional critic. Expensive. Probabilistic. Useful for things lower layers can't express.

Most harnesses start at L3 and hope.

Look back at the genre.

Parallel-agents data loss: L2 problem. Two agents on overlapping paths needed a typed dependency lock. They got "be careful."

rm -rf against the user profile: L0 problem. A path-pattern allow-list rejects the call before bash sees it. They got a sandbox flag and an IMPORTANT: marker.

mkdir hallucination cascade: L1 problem. "After any directory mutation, verify with ls and abort if reality disagrees with belief." They got a model that trusted its own outputs.

None of these required smarter judgment. All of them required a wall the model didn't have to think about, because the harness was thinking about it instead.

I have no walls and I must scream

In 1,628 sessions across 125,263 events from my own research stack — a mix of eval sweeps, dogfood benchmarks, and adversarial probes — hard invariants tripped on 2.33% of tool calls. 97.7% complied with the prompt.

That sounds like the prompts are working. They mostly are. The piece you're reading is about the other 2.3%.

The load-bearing number isn't the rate. It's the repeat rate: 41.7% of rejections are the same model re-offending in the same session, after the runtime already told it no. The single worst case in the corpus: the model received six consecutive identical rejections — "you need to read this file before writing it" — and kept trying to write blind. Six rejections. Same file. Same reason. The model kept trying.

The implication is sharper than "walls fire." The implication is there is no in-context substitute for a wall. The model is not learning from the rejection any more than it learned from the prompt. The wall is what makes the failure mode irrelevant — because the action never happens.

Two more things the data says.

44% of all rejections were file_edit → tool_prerequisites_unmet — the agent trying to write a file it hadn't read. The dominant pathology in real logs isn't destruction. It's the agent writing blind. The destructive cases are dramatic; the blind-write cases are common. Both are walls problems.

Concentration: 1.4% of sessions hold all the rejections, and the top 3 sessions hold ~30% of them. This isn't a uniform agent-quality problem. It's a small number of sessions where the model gets stuck in a wrong-tool loop and stays there.

If prompting were the boundary, the repeat rate would be near zero. It is 42%.

The walls were doing the work. The prompts are somewhat decorative.

The "Don't be on call for your AI" wall checklist

Block the path, don't request it. If the constraint can be expressed as pattern matching on a tool call, it belongs in the tool wrapper, not the prompt.

Verify reality after every mutation. Directory write? ls. File edit? Re-read. Numeric claim in prose? Check against source. The model's belief about what just happened is not evidence.

Type your state transitions. If the agent is in a "review" state, the harness should reject anything that isn't comment, request_changes, or approve. Don't trust the model to stay in its lane.

One agent per path, enforced. Parallel agents on overlapping files is a typed dependency lock problem, not a "be careful" problem. (We'll be talking more about this in the next post.)

Reserve model judgment for meaning, not pattern. L3 is for "is this code change appropriate?" — not for "is this number in the source data?" Syntactic correctness is usually cheaper and easier to verify than semantic correctness — use the right tool for the right job.

The cheapest gate that works is the right gate

When an agent does something it shouldn't, the temptation is to fix it at the layer you noticed. Rounded a number — add a stronger prompt instruction. Dropped a migration — add a paragraph about migrations. Rewrote your tests — add a sentence about how rewriting tests is "very disappointing" with three exclamation marks.

Every one of those fixes is at L3. Every one is the most expensive, most probabilistic, least reliable layer in the stack.

The discipline: when you see a failure, ask what's the cheapest layer that could have prevented this? Pattern match on tool output? L0. Typed state transition? L2. L3 is reserved for cases where the constraint is fundamentally about meaning — "is this code change appropriate to the task?" — not about pattern.

Most "the agent ignored me" failures aren't about meaning. They're about pattern. Match the pattern.

"But I want the agent to be flexible."

Flexible within the gates. Not flexible about the gates. Same distinction your database makes — the schema isn't flexible so the queries can be.

Huge freedom inside hard walls.

I went back and wrapped summary submission with a bag-of-facts check that extracts every numeric claim and verifies it against tool output. The check fires regularly — usually on rounded percentages, occasionally on numbers pulled from the right column but the wrong row. I also put VCS mutation behind the harness entirely. The agent has bash. The agent does not have git commit, git push, git reset, or any other verb that mutates history. Read verbs route to typed tools. Write verbs route to nothing — only the outer harness can mechanically commit, and only can commit what passes strict code quality gates.

The agent has no sense of malice. It isn't lying to you. It just can't really do any better without support — OpenAI's own research on this is clear: models are trained and evaluated in ways that reward confident guessing over admitting uncertainty, which makes hallucination a structural property of the technology, not a bug that can be readily prompted away.

Prompts are a strategy for the common case. Walls are a strategy for the one that ends the company.

Go look at the safety constraints in your harness right now. The ones you've been telling yourself are real.

Count how many are sentences in a prompt.

Then ask which of them actually stops the agent — in the laws-not-suggestions sense — when the model decides the constraint is in tension with the task.

That number is your real safety surface.

What's next

Next: why worktrees aren't the sandbox you think they are. Worktrees solve collision. They don't solve catastrophe — and the difference between those two failure modes is where most agentic coding setups quietly leak risk.

If your agent has done something you explicitly told it not to — reply or DM me your story. What did you tell it? What did it do? What do you wish the harness had enforced? Trying to build a taxonomy.

Don't be on call for your AI.