← Back to all posts

LLM Pricing Is Weird: A Practical Guide to Picking the Right Model for the Job

If you are trying to bring AI into real business workflows, the hard part usually is not finding a model that can do something impressive in a demo. The hard part is knowing which model to use for which job without lighting your budget on fire.

That sounds obvious, but it is where a lot of teams get tripped up.

Some teams point every task at the most expensive frontier model because they want quality. That works, but it can turn a promising automation into a scary line item. Other teams go the opposite direction: they pick the cheapest model that looks decent in a benchmark, then wonder why the workflow falls apart when the model needs judgment, nuance, or a little bit of business context.

The better answer is model routing.

Use cheap, fast models for repetitive work. Use long-context models when the main job is reading a mountain of material. Use premium models when the work requires judgment, strategy, synthesis, coding, or customer-facing polish. And if you have enough volume, privacy constraints, or customization needs, consider open-weight/self-hosted models — but do that with clear eyes about the operational cost.

This post is a practical guide to that decision. It is written for builders, operators, founders, and corporate teams implementing automation who know they need LLMs but may not yet have a feel for the model landscape.

A quick caveat before we get into the numbers: prices and capabilities change constantly. Treat the pricing here as a May 2026 snapshot, not procurement gospel. Before you commit budget, re-check provider pricing pages and run your own evals on your real tasks.

LLM model routing infographic: a practical guide to choosing premium, mid-tier, cheap, long-context, or self-hosted AI models based on task risk, visibility, volume, context, and privacy needs

A quick visual guide to the model-routing framework in this post: spend premium tokens where errors are expensive or visible, and use cheaper/faster models for repeatable automation steps.

Quick reference

If you only skim one section, skim this one.

If your task is...Usually start with...Why
Final executive synthesis, strategy, judgment, high-stakes writingClaude Opus/Sonnet, GPT frontier, Gemini ProYou are paying for better reasoning, better writing, and fewer embarrassing errors.
Everyday business automationClaude Haiku, GPT Mini/Nano, Gemini Flash/Flash-Lite, Grok Fast, DeepSeek FlashMost workflow steps do not need the most expensive model.
Long document analysisGemini Pro/Flash, Claude Sonnet/Opus, Llama Scout-style long-context modelsContext length and retrieval strategy matter more than raw benchmark scores.
Coding agents and code reviewClaude Sonnet/Opus, GPT Codex/GPT frontier, DeepSeek, Qwen CoderCode requires precision, iteration, and good tool use.
Classification, extraction, tagging, routingCheap fast models: Gemini Flash-Lite, GPT Nano/Mini, DeepSeek Flash, Llama/Mistral/Qwen hosted cheaplyThese are high-volume steps where cost dominates.
Private data, data residency, customization, very high volumeOpen-weight/self-hosted Llama, Qwen, Mistral, DeepSeek-family modelsToken pricing may matter less than control, privacy, and throughput economics.
A workflow with many agentsMix models by taskDo not use one model everywhere. Route based on risk, complexity, and output visibility.

The simplest rule of thumb:

Spend premium tokens where the output is visible, strategic, or hard to verify. Save money everywhere else.

Why model pricing feels confusing

LLM pricing looks simple at first: input tokens cost one amount, output tokens cost another. But in real workflows, the cost profile is rarely simple.

A few things make it confusing:

  1. Output tokens usually cost more than input tokens. Long generated reports, chain-of-thought-style reasoning modes, verbose agents, and repeated retries can get expensive quickly.
  2. Context size changes the economics. A model with a 1M-token context window can ingest huge documents, but feeding huge prompts is still not free.
  3. Agent workflows multiply calls. One “task” may actually be 20 model calls: planner, researcher, scraper, classifier, summarizer, verifier, writer, reviewer, and formatter.
  4. Benchmarks do not equal workflow performance. A cheap model may look great on a chart but fail at your company’s messy CRM notes or weird PDF contracts.
  5. Provider pricing moves fast. What looked expensive six months ago may be normal now. What looks cheap today may be replaced next quarter.

That is why the decision should not be “which model is best?” It should be:

What is the cheapest model that performs reliably enough for this specific step, given the cost of being wrong?

That last clause matters. If a model misclassifies a low-priority support ticket, the cost is small. If it sends a bad contract summary to a customer or invents a strategic recommendation for an enterprise account, the cost is much higher.

Pricing snapshot: May 2026

The table below uses per-1M-token pricing, input/output, in USD. Where available, I use current provider or routing-market pricing as a practical reference. Some providers have discounts, batch pricing, cached-token pricing, regional pricing, promotional pricing, or enterprise terms. Those details can matter a lot at scale.

Provider / modelApprox. input / output per 1M tokensContext windowPractical read
Anthropic Claude Opus 4.7$5 / $25~1MPremium reasoning, writing, and synthesis. Use when quality matters.
Anthropic Claude Sonnet 4.6$3 / $15~1MStrong default for serious business automation, coding, and analysis.
Anthropic Claude Haiku 4.5$1 / $5~200KFaster/cheaper Claude option for simpler workflow steps.
OpenAI GPT-5.5$5 / $30~1.05MPremium general-purpose model with strong ecosystem/tooling story.
OpenAI GPT-5.5 Pro$30 / $180~1.05MReserve for very high-value reasoning where cost is secondary.
OpenAI GPT-5.4 Mini$0.75 / $4.50~400KUseful mid-cost automation model.
OpenAI GPT-5.4 Nano$0.20 / $1.25~400KCheap routing, classification, and simple extraction.
Google Gemini 2.5 Pro~$1.25 / $10, higher at very long prompts in some tiers~1MExcellent long-context economics and document-heavy workflows.
Google Gemini 2.5 Flash~$0.30 / $2.50~1MStrong balance for bulk automation, summarization, extraction.
Google Gemini 2.5 Flash-Lite~$0.10 / $0.40~1MVery low-cost high-volume tasks.
xAI Grok 4.3~$1.25 / $2.50~1MAttractive output pricing for broad reasoning/synthesis tasks.
xAI Grok 4 Fast~$0.20 / $0.50~2MVery inexpensive fast model for bulk workflow steps.
xAI Grok Code Fast~$0.20 / $1.50~256KBudget coding-agent option.
DeepSeek V4 Pro~$0.435 / $0.87 promotional / routed pricing; base pricing may differ~1MCost-efficient reasoning/coding; good candidate for background agents.
DeepSeek V4 Flash~$0.14 / $0.28~1MVery cheap bulk extraction, summarization, and routing.
Meta Llama 4 Maverick, hosted~$0.15 / $0.60~1MOpen-weight option with attractive hosted economics.
Meta Llama 4 Scout, hostedvaries by host; routed examples around ~$0.08 / $0.30very large / host-dependentLong-context open-weight option; especially interesting for private or self-hosted work.
Qwen Coder / Qwen large modelsvaries; often low-cost via hosted routers256K–1M depending modelStrong coding and open-model ecosystem option.
Mistral Small / Large / Devstralvaries; often inexpensive128K–262K+ depending modelGood European/open-model option, especially where licensing and data residency matter.

The headline is not just that some models are cheaper. It is that some are orders of magnitude cheaper for specific workflow steps.

For example, compare a premium model at $5 input / $25 output to a flash model at $0.10 input / $0.40 output. If you are running millions or billions of tokens through extraction, tagging, and summarization, that difference is not rounding error. It can determine whether the automation is economically viable.

The model matrix

Here is the more useful decision matrix: not just what each model costs, but when you might actually use it.

Model familyStrengthsWeaknessesBest use casesAvoid when
Claude Opus / SonnetReasoning, writing quality, coding review, structured synthesis, following nuanced instructionsMore expensive than budget models; output can be costlyExecutive briefs, final reports, account strategy, code review, high-trust workflowsBulk extraction, simple classification, low-risk repetitive work
Claude HaikuClaude-style instruction following at lower costLess capable on hard reasoning than Sonnet/OpusFirst-pass summaries, triage, internal drafts, lightweight automationFinal strategic recommendations or high-stakes outputs
OpenAI GPT frontierBroad reliability, ecosystem, multimodal/tooling integrations, agent platformsPremium models can be expensive, especially output-heavy workflowsGeneral-purpose automations, tool-using agents, multimodal workflows, polished outputsCost-sensitive bulk steps unless using mini/nano variants
GPT Mini/NanoLow-cost general automation, extraction, routingLess depth and nuanceClassification, field extraction, deduping, quick summariesComplex reasoning, high-stakes writing, ambiguous tasks
Gemini ProLong-context analysis, document-heavy work, Google ecosystemPricing and behavior can vary by tier/context lengthLarge document review, research packets, meeting transcript analysis, knowledge-base synthesisTiny tasks where a flash model is enough
Gemini Flash / Flash-LiteExcellent economics for high-volume workloadsLess reliable than premium models on nuanced judgmentBulk summarization, extraction, routing, preprocessing, first-pass analysisFinal customer-facing strategy or subtle reasoning
GrokCompetitive price/performance, large-context options, X/search-adjacent ecosystemEcosystem and enterprise adoption may vary by orgResearch, fast synthesis, coding variants, cost-sensitive analysisRegulated environments where vendor approval is an issue
DeepSeekVery low cost for reasoning/coding-style tasksEnterprise risk review, data policy, and consistency should be evaluated carefullyBackground agents, code helpers, reasoning at scale, budget-sensitive workflowsHighly regulated or sensitive data without legal/security review
Llama / open-weightControl, privacy, customization, self-hosting, high-volume economicsYou own serving, scaling, evals, monitoring, updates, licensing analysisPrivate internal automation, predictable high-volume workloads, fine-tuningSmall teams that just need the best model tomorrow morning
Qwen / Mistral / other open modelsStrong open ecosystem, coding variants, regional/licensing alternativesQuality varies by model and hostCoding, private workloads, budget inference, experimentationWhen frontier reliability is required and model ops maturity is low

The practical routing framework

Instead of choosing a single “best” model, design your automation around risk and complexity.

1. Use premium models when the task requires judgment

This includes:

  • strategic recommendations
  • final executive summaries
  • nuanced writing
  • customer-facing messages
  • legal or policy interpretation, with human review
  • complex code review
  • ambiguous tradeoff analysis
  • tasks where the model needs to say “I am not sure” correctly

This is where Claude Opus/Sonnet, GPT frontier models, and Gemini Pro-style models earn their keep. You are not paying for tokens; you are paying for a lower probability of a bad answer at a high-leverage step.

2. Use mid-tier models for everyday work

A lot of business automation lives here:

  • summarizing CRM notes
  • drafting internal updates
  • turning meeting transcripts into action items
  • rewriting content for tone
  • generating first drafts
  • answering questions over retrieved context

Mid-tier models are often the sweet spot. They are good enough to feel smart, but not so expensive that every automation needs a CFO approval meeting.

3. Use cheap fast models for mechanical steps

This is where many teams overspend.

Do not use a premium reasoning model to answer questions like:

  • Is this email about support, sales, billing, or legal?
  • Extract company name, role, date, contract value, and renewal term.
  • Deduplicate these contacts.
  • Turn this transcript chunk into five bullet points.
  • Decide which downstream agent should handle this task.

These steps can usually be handled by cheap fast models, especially if you use structured outputs and validate the result.

4. Use long-context models when reading is the job

If the primary difficulty is that there is a lot of material — 300 pages of policy, years of support tickets, an RFP packet, a giant customer account history — context length becomes strategically important.

But be careful: long context is not magic. A model can accept a million tokens and still miss the one sentence that matters. For high-value workflows, combine long context with retrieval, chunking, citations, and verification.

Use long-context models when:

  • the relevant evidence may appear anywhere
  • you need broad synthesis across many documents
  • splitting documents would lose important relationships
  • the cost of retrieval misses is high

Use retrieval instead when:

  • the corpus is stable
  • queries are narrow
  • you can index the content well
  • you need repeatable citations

5. Use open-weight or self-hosted models when control matters

Open-weight models are not merely “cheap models.” They are a different operating model.

With hosted APIs, you pay per token and outsource serving. With self-hosting, you may reduce or eliminate per-token API fees, but you take on hardware, electricity, orchestration, scaling, monitoring, uptime, model updates, quantization, security, and evals.

Self-hosting can make sense when:

  • you have very high token volume
  • workloads are predictable enough to keep GPUs utilized
  • data residency or privacy requirements block third-party APIs
  • you need domain-specific fine-tuning
  • latency requirements favor local inference
  • you already have platform/ML engineering capacity

It probably does not make sense when:

  • usage is small or spiky
  • your team does not want to run inference infrastructure
  • you need the latest frontier capability immediately
  • the workflow changes every week
  • the model quality gap would cost more than the token savings

A good corporate posture is: start with APIs, measure real usage, then consider self-hosting once volume, privacy, or customization justifies the complexity.

Example workflow: sales/account intelligence agents

Let’s make this concrete.

Imagine you are building an AI sales intelligence workflow for enterprise account teams. The goal is to turn scattered information into a useful account brief before a sales call.

The system might pull from:

  • CRM notes
  • prior emails
  • support tickets
  • product usage data
  • company website
  • recent news
  • SEC filings or investor materials
  • LinkedIn-style profile data
  • call transcripts
  • previous proposals
  • competitive intelligence

The business value is obvious: reps spend less time researching and more time having relevant conversations. Managers get more consistent account planning. Customers get outreach that actually reflects their situation.

But if you use a premium model for every step, this workflow can get expensive fast. A better architecture uses multiple agents, each routed to the right model tier.

Agent 1: Data collection and normalization

Job: Gather raw material and convert it into clean structured records.

Examples:

  • company name
  • industry
  • employee count
  • revenue band
  • recent news headlines
  • known contacts
  • open opportunities
  • current products used
  • renewal date
  • support sentiment

Recommended model tier: cheap fast model or small open model.

Why: This is mostly extraction and formatting. Use structured JSON schemas, validation, and retries. You do not need a premium model unless the source material is extremely messy.

Good candidates: Gemini Flash-Lite, GPT Nano/Mini, DeepSeek Flash, Grok Fast, hosted Llama/Mistral/Qwen.

Agent 2: Source summarization

Job: Summarize each source into compact notes.

  • “Summarize the last 12 months of support tickets.”
  • “Summarize this earnings call section for business priorities.”
  • “Summarize the CRM history for this account.”

Recommended model tier: cheap or mid-tier model, depending on source complexity.

Why: Summarization is high-volume. Use lower-cost models for first-pass summaries, then pass only the compressed signal forward.

Good candidates: Gemini Flash, Claude Haiku, GPT Mini, Grok Fast, DeepSeek Flash.

Agent 3: Signal detection

Job: Identify buying signals, risk signals, and account triggers.

Examples:

  • new executive hire
  • expansion into a new market
  • cost-cutting language in earnings materials
  • competitor mentioned in support tickets
  • product adoption spike
  • renewal risk
  • unresolved support issues

Recommended model tier: mid-tier model, with escalation.

Why: This requires some judgment, but not always premium judgment. Use a mid-tier model first. Escalate to a stronger model when signals conflict or confidence is low.

Good candidates: Claude Haiku/Sonnet depending risk, GPT Mini/frontier depending complexity, Gemini Flash/Pro, DeepSeek Pro.

Agent 4: Account strategy synthesis

Job: Turn the signals into an account point of view.

This is where the brief becomes useful:

  • What is likely top of mind for this company?
  • What business problems might they care about?
  • Where do we have credibility?
  • What should the rep avoid saying?
  • What are the best discovery questions?
  • What is the account risk?
  • What is the recommended next step?

Recommended model tier: premium model.

Why: This is a judgment step. Bad synthesis can mislead the team. This is where Claude Sonnet/Opus, GPT frontier, or Gemini Pro makes sense.

Do not cheap out here. You have already saved money by compressing the raw research with cheaper agents. Spend the premium tokens on the step that matters.

Agent 5: Message drafting

Job: Draft a customer-facing email, call prep note, or executive briefing.

Recommended model tier: premium or strong mid-tier, depending visibility.

If the output goes directly to a customer or executive, use a stronger model and a review step. If it is an internal first draft, use a mid-tier model.

Good candidates: Claude Sonnet/Opus for writing quality, GPT frontier for ecosystem/tooling, Gemini Pro for long-context grounding.

Agent 6: Critic/reviewer

Job: Check the brief before it reaches a human.

The critic should ask:

  • Are claims supported by evidence?
  • Did we confuse facts with guesses?
  • Are there obvious hallucinations?
  • Is the tone appropriate?
  • Did we miss known account risks?
  • Are recommendations specific enough?

Recommended model tier: strong model, often different from the writer.

Why: Reviewer diversity helps. If Claude wrote the account strategy, have GPT or Gemini critique it. If GPT wrote it, have Claude review it. For lower-risk workflows, a cheaper critic can still catch formatting and citation problems.

What this workflow optimizes

This architecture gives you quality where it matters and savings where quality is easier to verify.

A bad version of the workflow does this:

Premium model reads everything, summarizes everything, reasons about everything, writes everything, and reviews itself.

A better version does this:

Cheap models clean and compress the data. Mid-tier models detect signals. Premium models synthesize strategy and final messaging. A separate reviewer model checks the output.

That is how you get high-quality results without paying premium prices for every token.

A durable way to choose models

Because the model landscape changes so quickly, I would avoid hard-coding a permanent “default stack.” The right answer today may be stale in three months.

Instead, use a routing policy.

Route by task type

  • Extract / classify / tag: cheapest reliable structured-output model
  • Summarize: cheap or mid-tier model, depending complexity
  • Research: long-context or retrieval-augmented model
  • Reason: premium or specialized reasoning model
  • Write final output: premium model if visible/high-stakes
  • Review: strong model, preferably different from the writer
  • Code: coding-specialized model or frontier model with good tool use

Route by risk

Ask: what happens if the model is wrong?

  • Low risk: use cheap models, validation, and sampling
  • Medium risk: use mid-tier models plus confidence checks
  • High risk: use premium models, citations, human review, and audit logs

Route by volume

Ask: how often will this step run?

  • Rare and important: premium is fine
  • Frequent and simple: optimize aggressively
  • Frequent and complex: consider caching, batching, distillation, fine-tuning, or self-hosting

Route by visibility

Ask: who sees the output?

  • Internal intermediate output: cheap is fine
  • Internal decision support: mid-tier or premium depending stakes
  • Customer-facing output: premium plus review
  • Regulated/legal/financial output: premium plus human approval

Cost controls that matter more than model choice

Model choice is important, but it is not the only lever.

1. Compress before you reason

Do not send 500 pages to a premium model if a cheap model can extract the 20 relevant facts first.

2. Cache stable context

Policies, product docs, account profiles, and knowledge-base entries often change slowly. Use caching or retrieval instead of repeatedly paying to resend the same text.

3. Use structured outputs

JSON schemas reduce retries, make validation easier, and let cheaper models succeed more often.

4. Add confidence thresholds

If a cheap model is confident and passes validation, continue. If not, escalate.

5. Separate generation from review

A reviewer model can catch unsupported claims, formatting errors, and weak reasoning. This is especially useful for agent workflows.

6. Measure actual token usage

Do not estimate from vibes. Log input tokens, output tokens, retries, latency, and user satisfaction by workflow step.

7. Run evals on your own tasks

The best benchmark is your messy data. Build a small eval set of real examples, expected outputs, and failure cases. Re-run it whenever you change models.

Where self-hosted models fit

The self-hosting conversation usually starts with cost, but cost is only part of it.

Self-hosted/open-weight models can be compelling when you need:

  • privacy and data control
  • predictable high-volume inference
  • custom fine-tuning
  • lower latency inside your own network
  • domain adaptation
  • independence from a single API vendor
  • regional or regulatory control

The open model ecosystem is also getting much stronger. Llama, Qwen, Mistral, DeepSeek-family models, and specialized coding models are good enough for many internal automation steps. In some cases, they are not just “good enough” — they are the right tool because you can control them.

But self-hosting has real costs:

  • GPU capacity planning
  • serving infrastructure
  • autoscaling or queueing
  • monitoring and observability
  • model upgrades
  • security patching
  • evals and regression testing
  • prompt/model compatibility changes
  • licensing review

The break-even point depends on utilization. A GPU sitting idle is expensive. A GPU running predictable high-volume workloads can be very efficient.

My practical advice: do not start with self-hosting just because it sounds cheaper. Start by measuring API usage. If you are pushing serious volume through stable, repeatable workflows — or if privacy/regulatory requirements demand it — then evaluate self-hosting with real numbers.

Also, pay attention to licenses. “Open weights” does not always mean “do whatever you want.” Some models use permissive Apache/MIT-style licenses. Others, including Llama-family models, have community licenses with commercial restrictions. That may be totally fine for your use case, but it is not something to discover after rollout.

So which model should you use?

Here is the short version:

  • If the task is strategic, ambiguous, or visible, use a premium model.
  • If the task is routine and easy to validate, use a cheap model.
  • If the task is reading a huge amount of context, use a long-context model or retrieval.
  • If the task is private, high-volume, or customized, evaluate open-weight/self-hosted models.
  • If the workflow has multiple agents, route each agent separately.

The biggest mistake is treating model selection like a one-time vendor decision. It is not. It is an architecture decision.

The best AI systems will not be built around one model. They will be built around routing, evals, observability, and thoughtful escalation.

That is less exciting than saying “Model X is the best.” But it is much more useful.

Final thought

LLM pricing is weird because LLM work is weird. A “task” might be one short answer or a chain of 40 hidden steps. A cheap model might be perfect for 80% of the workflow and disastrous for the final 20%. A premium model might be too expensive for bulk processing but incredibly cheap compared to one bad executive recommendation.

So do not ask, “Which model should we use?”

Ask:

What parts of this workflow need intelligence, what parts need speed, what parts need context, and what parts need judgment?

Once you answer that, the pricing starts to make a lot more sense.

Sources and pricing references

Pricing and model details were checked in early May 2026. Always verify current prices before procurement or production rollout.

LLM Pricing Is Weird: A Practical Guide to Picking the Right Model for the Job · Matt Rowe