When AI earns its place, and when it doesn't

Most AI features fail in production not because the model was wrong, but because the surrounding system was unbuilt. A short field guide to noticing the difference, and what to do about it.

A grid of identical quiet geometric tiles with a single tile lit softly in warm amber - the one that earned its place.

We've now built AI features into enough products to spot the pattern reliably. The teams whose AI launches succeed and the teams whose AI launches stall don't differ in their model choice. They differ in everything around the model.

The wrong question

Most AI strategy conversations open with “which model should we use?” That's almost never the question worth starting with. Models change every few months. The decisions that matter more, and that decay slower, are about evaluation, retrieval, observability, and the boundary between the model and everything else.

The questions worth asking

How will we measure quality? If you don't have evals before launch, you'll be debugging in production via Slack screenshots.
Where does the context come from? Most interesting AI features are retrieval problems wearing model clothing. The quality of the retrieval will outweigh the quality of the model in real-world performance.
What can we see when something goes wrong? Logs, traces, replay. If a user complains, you need a reproducible record of what the system saw, what it sent, what came back.
Where does the model's authority stop? The model proposes, a deterministic layer disposes. Especially for anything that touches user data, money, or business state.
How do we swap the model? A new model release shouldn't be a leap of faith. With proper evals, it's a controlled change.

The work that's usually missing

On every AI engagement we've run, the same parts of the system are the ones the original team hadn't prioritised. Evals. Observability for cost, latency, and behaviour. Prompt regression suites. A clear separation between model-generated content and system-of-record state. None of this is exotic. None of it is hard to learn. It's just missing in the parts of the industry that are still working out what production AI looks like.

“The difference between a working AI product and a clever prototype isn't the model. It's the rest of the iceberg.”

What “earning its place” looks like

AI earns its place in a system when it does work the rest of the system can't reasonably do, and when its outputs flow back into a structured layer the rest of the system can act on. Free text is dangerous, typed extraction is useful. A black-box prediction is suspicious. A prediction with confidence, provenance, and a deterministic safety check is operable.

If you can articulate what your AI feature is doing in terms of “turning shape X into shape Y, with the following error modes,” you're in good shape. If the answer is “making things feel smart,” the surface area is still too vague to build well.

One last test

Ask whoever's proposing the feature, “how would we know this regressed?” If the answer is anything other than “we run evaluations and they would drop,” the feature isn't ready to ship, even if it currently looks fine.

That's most of the discipline. There's craft inside the details, but the high-level shape is unspectacular — define what good looks like, measure it, build the surrounding system that the model needs to do its job. AI earns its place when those things exist. It doesn't, when they don't.

Keep reading.

Case notes

PayWise, two years on - what an OutSystems product looks like at maturity

Zero incidents in two years. Ten thousand statements processed. A small team. Fixed operating costs. The honest, unglamorous case for what an enterprise low-code platform actually buys you over time.

7 min read

Practice

Going in-house with AI, without going it alone

If your strategy is to grow AI delivery capability inside your own team, the harness is the part that decides whether it works. Here's the shape we use to set it up, then hand the keys over.

7 min read

Case notes

From OutSystems to AI-native, how we re-platformed Finbridge in 10 weeks

What it actually looks like to move a regulated financial services platform off OutSystems onto a modern AI-native stack, including the parts that surprised us.

9 min read

Want to talk about this?

We're always up for a conversation about the work, the patterns we're seeing, what's worked, what hasn't. No pitch deck.

hello@doddledesign.co.uk →

Let's talk