AI Engineering Needs Audit, Not Faith
Useful AI in engineering keeps the same property useful tests have: a human can read what happened, where it went wrong, and decide what to do next. The argument is not that AI is dangerous. It is that anything shipping to production has to be inspectable when it fails.
Most introductions to AI in engineering start with capabilities. A new model handles longer context. A coding assistant ships pull requests. A retrieval system answers questions over a knowledge base. The pitch is always forward — what is now possible. This is the wrong starting point. The right starting point is the property useful tests have: a human can read what happened, where it went wrong, and decide what to do next. An AI system that ships to production is a piece of software like any other. Software that cannot be inspected after it fails is software that cannot be operated. The argument is not that AI is dangerous. It is that anything shipping to production has to be inspectable when it fails.
The discipline is not new. The same engineering tradition that produced unit tests, integration tests, logging, tracing, and post-incident review is the tradition AI engineering has to inherit. The vocabulary is slightly different — evaluation harness instead of test suite, prompt instead of input, eval instead of assertion — but the principles are identical. A system without tests is a system whose behavior is unknown. A system without observability is a system whose failures are mysteries. A system without permissions is a system whose damage is unbounded. None of this changes because the inner mechanism is a language model.
The audit question
Two questions decide whether an AI workflow is fit for production. What does it do? — the capability question that everyone asks. And what happens when it fails? — the question almost nobody asks until production demands the answer.
The first question can be answered with a demo. The second cannot. A demo shows the workflow succeeding on a few hand-picked inputs. Production exposes the workflow to inputs the demo never considered, at volumes the demo never reached, under conditions the demo never simulated. The difference between an AI feature that holds up and an AI feature that does not is whether the team built the infrastructure to answer the second question before the first stopped impressing anyone.
How does the failure manifest? — silent, loud, partial, drift, regression. Who notices? — the user, the operator, a downstream system, nobody. How is it diagnosed? — through logs, through evals, through user reports, through guessing. How is it fixed? — by changing the prompt, by changing the model, by changing the data, by changing the system around the model. None of these have answers if the audit infrastructure was not built.
An AI workflow that cannot be audited after the fact is a vendor demo, not a production system. The distinction is not academic. Production systems are operated by humans who need to know what is happening and why. Vendor demos are designed to impress humans who do not need to operate them.
Evals are tests
The most useful reframing for any team starting AI work is that evaluation harnesses are not separate “AI infrastructure.” They are tests. They run on a set of inputs, they produce a verdict against expected behavior, they catch regressions, they document what the system is supposed to do. Read this way and the investment justification becomes the same as any other testing investment. The team is not building something exotic. The team is building the test suite for a piece of software whose internals happen to be opaque.
The structure of an eval is identical to the structure of a test. A defined input. A defined expected behavior, expressed either as a deterministic check or as a rubric a judge applies. A defined verdict — pass, fail, or some graduated score. A defined run cadence — every change, every deploy, every night, every quarter. The implementation differs from a unit test only in that the expected behavior may need a model to evaluate (an LLM-as-judge for open-ended outputs) and the input set may need to be larger to capture stochastic variance. Both are technical details. The discipline is the same.
What makes an eval suite earn its place is the same property that makes a test suite earn its place. Catching regressions cheaply. Drawing a line between what the system is supposed to do and what it actually does. Surfacing failures in a form a developer can act on. An eval that returns a single average score across a thousand cases is not actionable. An eval that flags the seven specific cases where behavior changed is. The diff matters more than the aggregate.
Permission boundaries are poka-yoke
AI workflows that take action — write code, call APIs, modify databases, send communications — have a much wider damage surface than AI workflows that only return text. The damage surface is bounded by permissions, and permissions are poka-yoke.
The principle is the manufacturing one. A line is safer when the worker cannot make the mistake, not when the worker is asked to remember not to. The same applies to AI tool use. An agent that has read-only database access cannot drop a table. An agent whose code-modification scope is confined to a feature branch cannot push to main. An agent that can only send messages to a sandboxed test channel cannot email a customer. None of these constraints depend on the model behaving correctly. They depend on the system around the model being designed correctly.
Permissioned tool calls are the structural pattern that makes this enforceable. A tool exposed to an agent has a defined schema, a defined scope, and a defined audit trail. Calls outside that schema fail. Calls outside the scope fail. Calls inside the scope are logged. The agent does not need to be trusted with the full surface of the underlying system. It is trusted with exactly the surface a human reviewer would have approved for it.
Teams that skip this step end up with workflows that work in development and produce expensive incidents in production. The mistake is treating the model as if it were a careful engineer. The model is not a careful engineer. The model is a probability distribution over likely next tokens. The job of the surrounding system is to make sure the most unlikely next tokens cannot do harm.
The Problem → Prompt → Eval → Review → Ship loop
The discipline has five stages. Each produces an artifact the next consumes.
Problem. Stated in user-facing terms, with the failure mode that matters and the acceptance criteria. Not “improve the summary feature.” Rather, “summaries miss the lead in 18% of investor-call transcripts; reduce to under 5% without increasing length over a defined threshold.” A problem stated this way is a problem that has a measurable answer.
Prompt. The system instruction, retrieval setup, tool schemas, and few-shot examples that constitute the configuration the model receives. Treat it as a configuration artifact, not as a string. Version it. Diff it. Review changes the same way code is reviewed.
Eval. The dataset and the verdict logic that decide whether the prompt meets the problem’s acceptance criteria. The dataset is curated, not improvised, and grows with every failure encountered in production. The verdict logic is reproducible — a human and a machine should reach the same verdict on the same input.
Review. Human inspection of the eval results, the prompt diff, and the dataset diff. Not every change requires the same review depth, but every change going to production crosses a review threshold. Reviews focus on the dimensions an eval cannot judge — tone, safety, brand register, edge cases that earn future eval coverage.
Ship. Deploy the new prompt with rollback ready. Sample production traffic into the eval set. Watch the live metrics that correlate with the offline eval verdict. If the correlation breaks, the eval set is not capturing the real distribution; expand it.
Production traffic feeds back into the eval set, the eval set guards the next change, and the next change ships through the same five stages. Teams that compress this into “tweak the prompt and see if it looks better” end up with a system whose behavior nobody can characterize.
Output diffing as regression detection
The most cost-effective production safeguard against silent regression is output diffing. The principle is the same as snapshot testing in conventional software. A defined set of inputs is run against the production system. The outputs are compared to a baseline captured at a known-good point. Differences are flagged for human review.
Diffing works because most useful AI work has long stretches of stable behavior interrupted by occasional drift. Drift is the failure mode that traditional alerting misses: no error, no exception, no metric crossing a threshold — just outputs that, today, are subtly different from yesterday’s outputs in a way that matters. A diff that surfaces the change while it is still small is the difference between a five-minute review and a three-week incident response.
The implementation is mundane. A representative sample of inputs is replayed on a cadence — daily, hourly, whatever the change rate justifies. Outputs are stored. Subsequent runs compare against the prior run. Tooling presents the diffs to a reviewer in a form that surfaces what changed and not just whether something changed. The discipline turns the model from a black box into something with a Git-style history.
Logs that get read
Observability in AI systems has the same failure mode it has in conventional systems: teams log everything and read none of it. The same principle applies. Logs that get read are logs that surface the question the operator is going to ask during an incident. Logs that nobody reads are an expensive bytes-of-disk problem with no upside.
The questions that matter during an AI incident are predictable. Which prompt was active. Which model version. What retrieval context was returned. What tools were called and with what arguments. What the user actually saw. Each of these has to be reconstructible from the logs, with the same input identifier joining them. A trace ID propagated through every step is the cheap way to do this. The expensive way is to discover, mid-incident, that the data is in three places that cannot be correlated.
The corollary is that AI systems should expose a “last hundred calls” view to the team that operates them. Not a dashboard of averages — averages hide outliers. A sampled feed of real interactions, in reading order, with everything that contributed to each one. Most production AI issues are diagnosed by reading ten of these, not by querying a dashboard.
The cultural artifact
The discipline assumes a culture that treats AI work as engineering rather than as demos. Cultures that reward demos punish anyone who slows them down to build evals, permission boundaries, and logging. Cultures that treat AI features as products with operational obligations reward the same. The difference is not the model. The difference is the reward structure.
Teams that get this right ship AI features more slowly at first and accelerate over time, because each feature builds eval infrastructure the next feature reuses. Teams that get it wrong ship faster at first and stall, because each new feature lands on top of a system nobody can characterize. The pattern is identical to the trajectory of teams adopting CI, observability, or any other production-engineering practice. The technologies are negotiable. The discipline is not.
What it adds up to
An AI workflow that ships to production is software. Software needs tests — evals. Software with action surface needs permissions — bounded tool scopes. Software whose behavior can drift needs regression detection — output diffing. Software that fails in production needs logs that get read. None of this is novel. All of it is the same body of practice software engineering has accumulated since CI became standard.
The reframing that helps is to treat AI engineering as a special case of the existing discipline, not as a new discipline. The model is opaque, the inputs are noisy, the outputs are stochastic — and none of that exempts the system from the same operational obligations as the rest of the production estate. Tests, permissions, observability, review. The vocabulary changes. The discipline does not.
An AI workflow that cannot be audited after the fact is a vendor demo, not a production system. Anything else is faith — and faith is not an engineering practice.