← Blog

Your Guardrails Don't Govern Anything

Content filters, prompt shields and output classifiers are useful tools, and the engineering behind them is serious work I do not want to dismiss. The word “guardrails”, though, keeps getting used as if it settled the governance question, and it does not. They operate at the model boundary, answer model-boundary questions, and produce model-boundary evidence. Governance has a different job, in a different place, with different evidence.

Two Boundaries, Two Problems

Guardrails operate at the model boundary, which is the surface where the model produces text: the prompt going in, the completion coming out. Their job is to catch toxic content, jailbreak attempts, off-topic responses, hallucinated facts. Content safety is necessary work and I do not want any of what follows to read as a dismissal of it.

Governance operates at the action boundary, which is the surface where the system takes a consequential action: the tool invocation, the API call, the state change. The question is different too: was this specific action authorised under a specific policy, by a specific delegation, at this specific time? And, separately, can you prove it?

These are not competing answers to the same question. They are different controls at different points along the path from intent to consequence.

The Handoff Is Where Guardrails Stop

It helps, I think, to walk through what happens between the two boundaries in any autonomous system. The model produces text: a draft message, a tool-call payload, a structured query. A content filter examines that text and looks for toxic language, jailbreak artefacts, hallucinated facts, off-topic drift. Whatever passes, passes. The system then invokes a tool, or sends a message, or writes to a database, using whatever the model decided. The act of invocation is no longer in the filter’s path.

The filter evaluated the words the model produced. It did not, and could not, evaluate whether sending those words to an external recipient, or executing them as a query, is something this system is authorised to do on behalf of whoever instructed it. The filter is doing the job it was built for. The category error is in expecting it to authorise an action. It cannot, not because of poor implementation, but because it is operating at the wrong boundary.

Content Safety Is Not Action Authority

Content safety and action authority are doing different things at different points in the execution path. Content safety evaluates the text the model is about to produce or has just produced; it runs during generation. Action authority evaluates an action the system is about to take; it runs at the action boundary, on the canonical representation of that action, and only after intent has been resolved.

What I want to be honest about is that the deeper difference is in what each layer produces, and the production gap is what an audit eventually reaches for. Content safety produces a pass or fail on the model’s output. Action authority produces an explicit decision (ALLOW, DENY or ESCALATE) on a specific action, along with a tamper-evident record of the policy and delegation under which that decision was made. Treating guardrails as governance is, in that light, like treating spell-check as legal review: both examine documents, but only one establishes whether what the document authorises is legitimate.

Auditors Ask About Action Authority

When a regulator investigates an incident involving an autonomous AI system, they will not stop at whether the model’s output passed a content filter. The question will be about the action: whether it was authorised before it executed, under what policy, through which delegation chain, and whether the organisation can produce a tamper-evident record of the decision.

Guardrails cannot answer that question, not because they are poorly implemented but because they were never designed to. They operate at the model boundary, on text. They produce a record of what was said, not a record of what was authorised. The system’s actions, and the question of whether those actions are permitted, sit outside the surface a content filter can see.

The Line Sits At The Action Boundary

The short version is that guardrails are content safety and governance is action authority. One evaluates what the model says, the other evaluates what the system does, and they answer different questions for different audiences. Confusing them is how organisations end up with comprehensive content filtering and effectively zero governance over the consequential actions their autonomous AI systems execute every day.

Post-execution controls can explain what happened. They cannot turn an unauthorised action into a governed one after the fact.