bluecloud.dev

Evals Are Tests

The reluctance to call AI evaluation harnesses tests costs more than the renaming saves. Tests get reviewed, run in CI, and have an owner. Evals deserve the same treatment.

The reluctance to call AI evaluation harnesses tests costs more than the renaming saves.

When a team builds an eval harness for an LLM-powered feature, the harness usually ends up shaped like a test suite. It has fixtures, assertions, a runner, a way to compare expected output against actual, and a place where someone reads the failures and decides what to do. The only difference from a unit test is that the assertion is softer — did the output meet this rubric? rather than was the return value 42?

Calling that harness a test puts it where it belongs: in the engineering practice that already exists. It gets reviewed. It runs in CI. The team that breaks it knows they broke it. The team that adds a feature adds an eval the way they would add a test.

Calling it “evaluation infrastructure” or “the AI eval layer” puts it somewhere else. Now it lives in a separate quadrant on a slide. It has separate ownership. It runs on a separate cadence. The team that breaks it does not find out until the AI team runs their eval suite next Tuesday.

The renaming is not neutral. It is a hiding.

The lineage here is also a borrowing from manufacturing. Statistical process control did not invent quality assurance as a separate role — it embedded measurement in the line. The line that measures itself does not need a separate inspection team. The line that does not measure itself needs one.

An LLM-powered feature that does not have an eval harness running in CI is unmeasured. The team running it does not know if the feature is regressing. They will find out when a user reports something strange, which is the worst time to find out.

The cost of calling evals tests is that the team has to take them as seriously as tests. Which is the entire point.

  • quality
  • feedback-loops