Most AI features fail in production not because the model was wrong, but because the surrounding system was unbuilt. A short field guide to noticing the difference, and what to do about it.

We've now built AI features into enough products to spot the pattern reliably. The teams whose AI launches succeed and the teams whose AI launches stall don't differ in their model choice. They differ in everything around the model.
The wrong question
Most AI strategy conversations open with “which model should we use?” That's almost never the question worth starting with. Models change every few months. The decisions that matter more, and that decay slower, are about evaluation, retrieval, observability, and the boundary between the model and everything else.
The questions worth asking
- How will we measure quality? If you don't have evals before launch, you'll be debugging in production via Slack screenshots.
- Where does the context come from? Most interesting AI features are retrieval problems wearing model clothing. The quality of the retrieval will outweigh the quality of the model in real-world performance.
- What can we see when something goes wrong? Logs, traces, replay. If a user complains, you need a reproducible record of what the system saw, what it sent, what came back.
- Where does the model's authority stop? The model proposes, a deterministic layer disposes. Especially for anything that touches user data, money, or business state.
- How do we swap the model? A new model release shouldn't be a leap of faith. With proper evals, it's a controlled change.
The work that's usually missing
On every AI engagement we've run, the same parts of the system are the ones the original team hadn't prioritised. Evals. Observability for cost, latency, and behaviour. Prompt regression suites. A clear separation between model-generated content and system-of-record state. None of this is exotic. None of it is hard to learn. It's just missing in the parts of the industry that are still working out what production AI looks like.
“The difference between a working AI product and a clever prototype isn't the model. It's the rest of the iceberg.”
What “earning its place” looks like
AI earns its place in a system when it does work the rest of the system can't reasonably do, and when its outputs flow back into a structured layer the rest of the system can act on. Free text is dangerous, typed extraction is useful. A black-box prediction is suspicious. A prediction with confidence, provenance, and a deterministic safety check is operable.
If you can articulate what your AI feature is doing in terms of “turning shape X into shape Y, with the following error modes,” you're in good shape. If the answer is “making things feel smart,” the surface area is still too vague to build well.
One last test
Ask whoever's proposing the feature, “how would we know this regressed?” If the answer is anything other than “we run evaluations and they would drop,” the feature isn't ready to ship, even if it currently looks fine.
That's most of the discipline. There's craft inside the details, but the high-level shape is unspectacular — define what good looks like, measure it, build the surrounding system that the model needs to do its job. AI earns its place when those things exist. It doesn't, when they don't.


