Evals, not demos - the AI quality habit that separates products from prototypes

If your AI feature has no evaluations, it has no quality. A short, opinionated take on the discipline we apply to every AI build.

Concentric rings and arcs arranged vertically like calibrated gauges - the rigour of continuous measurement, in abstract.

The single highest-leverage practice we've adopted across AI delivery is also the most under-sold — evals. Not demos. Not screenshots. Not “our model's really good now.” Evaluation suites, written down, runnable, and run.

What we mean

An eval is a programmatic check. Given this input, is the AI feature producing an output we'd accept? It's a test, but the assertions are usually softer than a unit test (does it cite the right document? is the answer factual against the source? does it hit the right schema?), and the assertions are often produced or graded by another model or a human.

Evals don't exist to prove the model is perfect. They exist to make change safe. A new prompt, a new model version, a new retrieval setting, without evals, all of those are leaps of faith. With evals, they're changes you can ship with confidence.

Why teams skip them

They're boring to build. A good eval suite is meticulous work, gold sets, edge cases, regression cases. The reward isn't a shippable demo, it's a number that quietly moves up over time.
They feel optional during the build. The feature appears to work. The model seems fine. It's only weeks later, when something quietly regresses, that the absence becomes a fire.
They require discipline the team hasn't built. Unlike a unit test, an eval needs ongoing curation. The first version is the easy part.

What to evaluate first

Start with the questions you'd ask in production if a user complained.

Did the model retrieve the right context?
Did it cite its source, and was the citation correct?
Did the answer match the source factually?
Did it follow the required output shape (JSON schema, structured fields)?
Did it stay inside the guardrails (no PII, no policy violations, no out-of-scope responses)?
Did the cost and latency stay within budget?

“Evals don't replace human judgement. They replace end-of-week panic.”

What good looks like

A small gold set, hand-curated, that covers the most common inputs and the most painful edge cases.
A larger synthetic set, generated to cover input variations the gold set won't.
A scoring pipeline that produces a single composite number per run, and lets you drill into individual cases when it moves.
A CI gate that blocks deploys when the composite drops below an agreed threshold.
A regular cadence, weekly, monthly, for adding cases the team encounters in production.

The honest verdict

Most AI features you see in the wild don't have proper evals. You can tell because they regress unpredictably, change behaviour when models update, and require their team to be constantly available to firefight. The teams whose features feel reliable are almost always the teams who've invested in this work.

If you're commissioning an AI build, asking “how will we measure quality?” before “which model?” will tell you a lot about the team you're considering hiring.

Keep reading.

Case notes

PayWise, two years on - what an OutSystems product looks like at maturity

Zero incidents in two years. Ten thousand statements processed. A small team. Fixed operating costs. The honest, unglamorous case for what an enterprise low-code platform actually buys you over time.

7 min read

Practice

Going in-house with AI, without going it alone

If your strategy is to grow AI delivery capability inside your own team, the harness is the part that decides whether it works. Here's the shape we use to set it up, then hand the keys over.

7 min read

Case notes

From OutSystems to AI-native, how we re-platformed Finbridge in 10 weeks

What it actually looks like to move a regulated financial services platform off OutSystems onto a modern AI-native stack, including the parts that surprised us.

9 min read

Want to talk about this?

We're always up for a conversation about the work, the patterns we're seeing, what's worked, what hasn't. No pitch deck.

hello@doddledesign.co.uk →

Let's talk