Choosing an AI Model for Business

Model choice has a way of swallowing a project. Teams pore over leaderboards, argue about brand names and lose sight of the actual job. For a business system, the best AI model often isn’t the newest or the biggest. It’s the one that does your task reliably, on your data, inside your constraints.

Sounds obvious. It’s still where a lot of projects come unstuck, and the reason is that model selection is the fun part. It feels like the decision, the way choosing a car feels like the decision even though the commute is what you actually live with. Meanwhile the parts that determine whether the project works, the task definition, the test set, the review process, the plumbing around the model, are homework. So teams do the fun part first and at length, pick a name-brand model on vibes and a leaderboard screenshot, and then discover in production that the leaderboard never measured anything resembling their supplier invoices.

Define the task first

Summarising meeting notes is not the same job as pulling fields out of supplier documents. A model answering customers carries a different risk profile from one drafting something a person will check before it goes out. Writing code needs a completely different test from classifying emails.

So before you pick anything, write down the task, the error rate you can live with, who reviews the output and what a mistake actually costs. Now the selection has something real to measure against.

Those numbers change everything downstream, which is why they come first. If you’re extracting line items from invoices and a human approves every one before payment, a 95% accurate model might be fine, because the 5% get caught at review and the process still saves hours. If the same extraction feeds payments with no review, 95% means one invoice in twenty is wrong and nobody notices, which isn’t automation, it’s a slow leak with a dashboard. Same model, same accuracy, completely different verdict. Teams that skip this step end up arguing about which model is “better” when they haven’t agreed what better means for the job in front of them.

Test on your examples

Public benchmarks are fine for getting your bearings, but they won’t tell you whether a model can cope with your invoices, your contracts, your policies, your forms, your acronyms, your product names and your weird edge cases. Build a small test set out of your own material, with the sensitive bits stripped out or handled safely.

Fifty examples is plenty to start, and they should be drawn from the ugly end of reality, not the tidy end: the scanned invoice with handwriting on it, the email that’s three questions in one, the contract with the amended clause stapled on. Include the cases where the right answer is “this needs a human”, because how a model behaves when it should be unsure matters as much as how it behaves when it’s right. Write down the correct answer for each example before you run anything, otherwise you’ll grade on impressions, and impressions grade fluency, not accuracy.

Then put the outputs side by side and look at accuracy, consistency, how it refuses, how it formats, how good its citations are, how fast it runs and what it costs. The winner is sometimes not the one you expected. It’s routine to find a mid-tier model matching the flagship on a narrow extraction task at a fraction of the cost per call, and just as routine to find that the model with the best reputation for prose is mediocre at your specific document layout. You only learn either from your own examples. And keep the test set when you’re done; it’s not scaffolding, it’s an asset. Every future model release gets judged against it in an afternoon instead of a debate.

Consider data sensitivity

Some tasks can sit happily on a public API because the information is low risk. Others need private cloud, self-hosted models or local deployment. Sort the privacy question out before you pick the model, not after you’ve already wired it in.

The order matters because sensitivity is a constraint, not a preference, and constraints go first. If the data is privileged client material or under contractual data-handling clauses, then the eligible model list shrinks to what can run inside a deployment you control, and there’s no point benchmarking anything outside it. Teams that pick the model first and discover the constraint later face an ugly choice: rebuild on an eligible model, or quietly convince themselves the data isn’t that sensitive after all. The second option is chosen more often than anyone admits, and it’s how confidential records end up on infrastructure nobody signed off. If the task touches contracts, personal information, commercial strategy, health records, legal records or sensitive operational data, slow down and choose the deployment deliberately. We’ve laid out the questions in our data sovereignty guide.

Cost is more than price per token

A cheap model that needs three retries, heavy review and constant correcting can easily cost more than a dearer one that gets it right first time. A local model carries hardware and maintenance overhead but gives you predictable long-run economics. A cloud model is usually cheaper to prove out and simpler to upgrade.

Work an example. Say a document workflow runs 10,000 items a month. Model A costs half a cent per item and gets 88% right; Model B costs three cents and gets 97% right. On tokens alone, A wins by $250 a month. But A’s extra 900 failures each need a human minute or two to catch and fix, call it 20-plus hours of staff time, which at any Australian wage swamps the token savings several times over. The dearer model is the cheaper system. It cuts the other way too: for a low-stakes internal task with light review, the cheap model’s failures might cost nothing worth counting, and paying flagship rates is just vanity. The comparison that matters takes in usage volume, latency, review effort, hosting, integration work and support, not just the sticker on the token.

Keep the model replaceable

Models move fast. Unless you’ve got a clear reason, don’t weld your business system to one provider. A well-built applied AI integration keeps prompts, evaluation sets, logging and the provider calls walled off from the rest of the application.

That way you can swap models down the track without tearing the process apart to do it. This isn’t hypothetical caution; it’s the observed rhythm of the last few years. Prices have dropped, capabilities have jumped, and the best-value model for a given task has changed roughly every six months. A business that built replaceable is a config change and a test run away from banking each improvement. One that welded itself to a provider watches the improvements from behind glass, which is the same vendor lock-in story software buyers already know, replayed at AI speed. Your test set is what makes swapping safe: run the candidate against it, compare, decide on evidence.

Latency deserves its own sentence before the wrap-up, because it hides until launch. A model that takes eight seconds to answer is fine for a batch job running overnight and fatal for a tool a staff member uses forty times a day; the humans stop waiting and go back to the old way, and the project dies of impatience rather than inaccuracy. Decide upfront which camp your task is in, interactive or background, and test response times with your real document sizes, not the demo’s three paragraphs.

Choose with evidence

A model decision should land on a small table of results, not a gut feeling. Which model did you test, on what examples, what failed, what did it cost, where does a person check the output, and what would make you reassess? A brand-name debate gives you none of that. The test set does.

If you’re staring at a shortlist of models and a stack of vendor claims, tell us what the task actually is and we’ll help you build the test that settles it, including the honest outcome where the winner is a cheaper model than anyone was arguing for.

→ All insights

Turn the thinking into a plan.

Send the process, risk or idea. We will help you work out what is worth doing first.

Get in touch → Take the AI assessment

Choosing the right AI model for business use

Define the task first

Test on your examples

Consider data sensitivity

Cost is more than price per token

Keep the model replaceable

Choose with evidence

Related reading

Most AI training teaches the party trick, not the work

Claude Opus 5 does the hard technical work without flagship pricing

The great AI cost crunch is already here

Turn the thinking into a plan.