If your AI feature has no evaluations, it has no quality. A short, opinionated take on the discipline we apply to every AI build.

The single highest-leverage practice we've adopted across AI delivery is also the most under-sold — evals. Not demos. Not screenshots. Not “our model's really good now.” Evaluation suites, written down, runnable, and run.
What we mean
An eval is a programmatic check. Given this input, is the AI feature producing an output we'd accept? It's a test, but the assertions are usually softer than a unit test (does it cite the right document? is the answer factual against the source? does it hit the right schema?), and the assertions are often produced or graded by another model or a human.
Evals don't exist to prove the model is perfect. They exist to make change safe. A new prompt, a new model version, a new retrieval setting, without evals, all of those are leaps of faith. With evals, they're changes you can ship with confidence.
Why teams skip them
- They're boring to build. A good eval suite is meticulous work, gold sets, edge cases, regression cases. The reward isn't a shippable demo, it's a number that quietly moves up over time.
- They feel optional during the build. The feature appears to work. The model seems fine. It's only weeks later, when something quietly regresses, that the absence becomes a fire.
- They require discipline the team hasn't built. Unlike a unit test, an eval needs ongoing curation. The first version is the easy part.
What to evaluate first
Start with the questions you'd ask in production if a user complained.
- Did the model retrieve the right context?
- Did it cite its source, and was the citation correct?
- Did the answer match the source factually?
- Did it follow the required output shape (JSON schema, structured fields)?
- Did it stay inside the guardrails (no PII, no policy violations, no out-of-scope responses)?
- Did the cost and latency stay within budget?
“Evals don't replace human judgement. They replace end-of-week panic.”
What good looks like
- A small gold set, hand-curated, that covers the most common inputs and the most painful edge cases.
- A larger synthetic set, generated to cover input variations the gold set won't.
- A scoring pipeline that produces a single composite number per run, and lets you drill into individual cases when it moves.
- A CI gate that blocks deploys when the composite drops below an agreed threshold.
- A regular cadence, weekly, monthly, for adding cases the team encounters in production.
The honest verdict
Most AI features you see in the wild don't have proper evals. You can tell because they regress unpredictably, change behaviour when models update, and require their team to be constantly available to firefight. The teams whose features feel reliable are almost always the teams who've invested in this work.
If you're commissioning an AI build, asking “how will we measure quality?” before “which model?” will tell you a lot about the team you're considering hiring.


