Proof of Concept to Production AI

AI proofs of concept are easy to be impressed by. A small clean dataset, a controlled demo, a few hand-picked examples, and the future feels close enough to touch. Production is a different animal. Production has messy data, impatient users, outages, permissions, cost limits, edge cases nobody anticipated, and someone eventually asking how a particular answer was reached.

This gap is where most business AI projects actually die, and it’s worth being precise about why. The proof of concept succeeded because it quietly excluded everything hard: the documents were the good ones, the questions were the expected ones, the user was the person who built it, and nothing had to survive a fortnight unattended. None of that was cheating. It’s what proofs of concept are for. The mistake is reading the demo as “90% done” when the honest reading is “possible”. The remaining work isn’t a residue of polish; it’s most of the engineering, and businesses that budget the production phase as a rounding error on the demo are the ones writing the whole thing off two quarters later.

The hard middle is turning that demo into a production AI system people can rely on day after day.

Data pipelines replace manual setup

A proof of concept usually runs on hand-picked files or a one-off export. Production needs the data to flow on its own. Documents get indexed on a schedule. When a file is deleted, it disappears from the answers too. Permissions update. And when a source system is slow or down, the whole thing has to cope rather than fall over.

The deletion case deserves a second look, because it’s the one that turns into an incident. A staff member is terminated, their access is revoked everywhere, and six weeks later the assistant cheerfully quotes from a document that was removed as part of the same process, because nobody built re-indexing. Or a contract is superseded, the old version is archived, and the system keeps answering from it because the index only ever grows. In demo-land these are edge cases. In production they’re Tuesday, and each one converts a user from “this is handy” to “check everything it says”, which kills the time savings the project was justified on.

This is ordinary data engineering, and it decides whether the AI behaves the same on Monday as it did in the demo.

Evaluation becomes ongoing

Testing a model once tells you almost nothing. Production AI needs a set of evaluation examples, error tracking, a queue for the cases that go wrong, and a way to compare versions over time. Change the prompt, the model, or the source data and you want to know whether quality went up or down before your users find out for you.

Without that, every update is a guess dressed up as progress. And updates are constant, that’s the part teams underestimate. Providers retire models on their own schedule, prices shift, someone wants a prompt tweak to fix one complaint, a new document type starts arriving. Each change is a chance for quality to move in either direction, invisibly, and “it seems fine” is not a measurement. The fix isn’t elaborate: keep the fifty-example test set from the selection phase, grow it with every interesting failure from the review queue, and rerun it before anything ships. An afternoon of discipline per change, and it converts the scariest question in production AI, “did we just make it worse?”, into something you can answer with a table. We’ve written more about building that test set in choosing the right AI model.

Users need clear failure paths

The system should be willing to say when it doesn’t know, when it can’t reach something, when its confidence is low, or when a person needs to look at the result before it goes anywhere. A tool that always tries to answer isn’t more helpful. It’s more dangerous, because it never tells you when to stop trusting it.

Good failure behaviour is most of how trust gets built, and it has a shape you can specify. “I couldn’t find this in the connected documents” beats a fluent guess. An answer with citations the user can click beats a confident paragraph with no provenance. A low-confidence flag that routes the case to a person beats silent delivery of a maybe. Design the correction loop too: when a user spots a wrong answer, there should be somewhere for that to go, a button, a queue, a human who triages it, because every caught error is a free evaluation example, and a user whose correction visibly improved the system becomes its advocate. Users don’t expect perfection from these tools. They expect honesty about the boundaries, and they abandon systems that bluff.

Security and cost need design

Production systems need access controls, logging, rate limits, monitoring, and a handle on cost. A feature that runs fine for ten users can get slow or expensive at a hundred. A tool that’s safe with public data can be a liability the moment it touches confidential records. None of this is a finishing touch you bolt on at the end. It belongs in the plan from the start.

Cost has a particular failure mode in AI systems: it scales with enthusiasm. The demo cost nothing worth mentioning because five people used it occasionally. Success means two hundred people using it constantly, plus the batch job someone scheduled, plus the integration that calls it on every record. Per-call pricing means your bill grows with adoption, which is exactly backwards from most software, where success amortises the cost. So the design needs the boring controls up front: caching for repeated questions, cheaper models routed to easier tasks, rate limits, and a monthly number someone actually watches. On the security side, the permission model has to be enforced at retrieval, not politeness, the system can’t answer from documents this user couldn’t open, and if the data is sensitive enough, the whole thing may belong inside infrastructure you control rather than on a public API.

Ownership must be named

Who watches the system? Who reviews the cases it got wrong? Who signs off prompt changes? Who deals with the vendor when their API breaks? Who decides the model has had its day and needs replacing? If those questions have no name attached, the system slowly rots while everyone assumes someone else is minding it. AI projects don’t maintain themselves.

The rot is gradual, which is what makes it dangerous. Nothing crashes. The answers just get a little staler as the business drifts away from the indexed documents, the review queue quietly fills with nobody triaging it, and usage declines in a way nobody’s dashboard reports because there is no dashboard. Eighteen months later someone asks whether the AI thing is still used, and the answer is “sort of”, which means no. Budget the custodian role from day one, it’s hours a week, not a headcount, and give them the authority to change prompts and escalate to whoever maintains the system. A named human with a small standing budget is the difference between a system that compounds and a system that composts.

The production test

A production AI system should keep working on a boring Tuesday. It should handle the usual mess, explain where its answers came from, respect who can see what, keep its costs in check, and give staff a clean way to correct it when it’s wrong. That’s less exciting than the demo, and it’s a lot closer to the part that actually creates value.

If you’ve got a proof of concept that impressed everyone and a nagging sense that the distance to production is longer than the slide deck implied, you’re reading it right. Tell us what the demo does and what it touches, and we’ll map the hard middle for you: what the pipeline, evaluation, permissions and ownership actually require, and whether the value at the end justifies the crossing.

→ All insights

Turn the thinking into a plan.

Send the process, risk or idea. We will help you work out what is worth doing first.

Get in touch → Take the AI assessment

From proof of concept to production AI: the hard middle

Data pipelines replace manual setup

Evaluation becomes ongoing

Users need clear failure paths

Security and cost need design

Ownership must be named

The production test

Related reading

A council AI assistant that guesses is worse than no assistant

AI search for policies and procedures, and the day it reads out the payroll file

AI quality control for Queensland manufacturers: the camera is the easy part

Turn the thinking into a plan.