A broken piggy bank
Feature

Inside the AI Cost Crisis: Why Inference Is Draining Enterprise Budgets

4 minute read
Nathan Eddy avatar
By
SAVED
AI inference costs are soaring. See how enterprises are rethinking models, infrastructure and budgeting as inference becomes the biggest driver of AI spend.

For most enterprises, inference now represents the majority of AI’s ongoing cost of ownership once systems reach production.

In practice, inference commonly accounts for 60%-90% of AI compute spend, because it scales continuously with user demand, token volume, latency requirements and availability of SLAs. Training is episodic and predictable; inference is persistent and elastic, and that’s what drives budget pressure.

Table of Contents

When Adoption Scales, Inference Costs Take Over

Danielle Ben-Gera, vice president of engineering at Crunchbase, explained that for many enterprises, inference is the largest share of AI total cost of ownership (TCO) because it scales directly with adoption and usage of user-facing features.

“When a smooth user experience depends on them, it’s hard to cap usage without impacting the product,” she said.

A key nuance, she added, in how inference is used is that the spend is not only driven by user-facing features like chat. “A lot of value comes from behind-the-scenes inference. We are using models to extract, classify and enrich data, which then improves the information users ultimately see.” 

As those use cases expand to more entities, more pipelines and more automation, inference naturally becomes the dominant cost center.

“Inference costs will likely trend upward over time because there are simply more and more high-ROI ways to apply it, with smarter models that can be useful for more applications,” Ben-Gera noted. For example, as organizations move toward deeper, more agentic systems (multi-step workflows that use multiple model calls and tools to reach an outcome), total spend can rise, even if per token price generally trends down.

Related Article: Reimagining Traditional Workflows With AI Agents

Cutting Costs With Model Mix-and-Match Strategies

According to Derek Ashmore, VKTR contributor and AI enablement principal at Asperitas, many organizations are shifting parts of their AI workloads to smaller, task-specific models and on-prem or reserved GPU capacity to rein in inference costs. “This is most common for high-volume, latency-sensitive or predictable workloads where utilization is steady, and ROI is clear."

However, few enterprises are abandoning large foundation models. Instead, they’re adopting a tiered model strategy: smaller local models handle routine tasks, while frontier models are reserved for complex reasoning or low-volume, high-value use cases.

“The cost crisis isn’t pushing enterprises away from AI, it’s forcing them to be far more deliberate about where and how inference runs,” Ashmore explained. 

Why Local LLMs Can Cost More Than Enterprises Expect

The real cost isn’t just the model but the end-to-end system around it, Ben-Gera cautioned. “With managed APIs, you can run data loading, unloading and orchestration on inexpensive instances and pay for inference based on usage."  

With a locally hosted LLM, organizations often end up running expensive GPU instances for the whole workflow, or they must build a more complex architecture to separate those concerns and keep utilization high.

“Once you account for infrastructure, reliability, monitoring and the engineering overhead to operate smaller self-hosted options safely, the savings are often less compelling."

3 Proven Tactics to Cut Inference Spend

The biggest wins come from reducing the number of calls and the number of tokens per outcome, while keeping quality predictable, Ben-Gera noted.

She recommended three guidelines: 

  1. Centralized routing to the “cheapest sufficient” option, enabled by a flexible model layer. Invest in a flexible, centralized inference layer combined with quick model output evaluation tools, which enables swapping in newer and better models over time and staying deliberate about which model is used for which job. That makes it practical to route work to the lowest-cost approach that still meets the quality bar, rather than defaulting everything to the most expensive model.
  1. Centralized guardrails and prioritization. If inference goes through a central place, the organization can put consistent guardrails on spend based on budget and actively prioritize which tasks to run, depending on business needs. This ensures usage growth doesn’t automatically translate into uncontrolled cost growth.
  1. Efficiency tactics that cut calls and tokens. By “doing more per call,” money can be saved by combining related generative tasks into one prompt/response where safe, and by using batching and Batch APIs when available, especially for non-real-time enrichment workloads.

Why AI Budgets Aren’t Exploding — Despite the Cost Crisis

Concerns about AI inference “blowing up budgets” are real, said Ashish Nadkarni, group VP and GM for IDC’s worldwide infrastructure research organization — but enterprise behavior is far more restrained than past technology cycles.

Organizations are not rushing to inflate spending simply because AI is strategically important, he claimed. Instead, most enterprises are taking a cautious, business-case-first approach shaped by lessons learned during earlier waves like big data and Hadoop. “Budgets are not just quadrupling because suddenly now AI is an existential threat to the company. Enterprises are taking a much more measured approach toward budgeting.”

According to Nadkarni, funding for inference-heavy workloads typically follows proof, not hype. Teams are expected to run pilots, validate use cases and demonstrate value using existing budgets before requesting incremental spend.

“People want to know what it is, how much it is going to help, where it is going to help and what the risks are,” he said. “Only if the business case is strong and the pilot has demonstrated value do they get additional allocations.”

Related Article: Cloud vs On-Prem AI: How to Choose the Right Enterprise Setup

AI Inference Requires a New Kind of Budget Discipline

Inference introduces a new budgeting challenge, said Nadkarni, because usage can spike unpredictably, particularly in seasonal or cyclical businesses. Planning for steady-state demand is no longer sufficient. To manage that risk, organizations should create centralized AI centers of excellence that evaluate inference costs, capacity planning and ROI across scenarios.

Learning Opportunities

Nadkarni frames AI inference as an area where enterprises should expect uncertainty — and plan for it — rather than chase rapid adoption. Inference introduces unknowns into business and operational workflows, making flexibility and patience essential as organizations adapt.

“I think it’s an evolving area, and enterprises need to keep an open mind,” he said. “What has worked for them so far may not always work going forward, especially in the world of AI.”

Common Questions on on AI Inference and Budget Control

Look for sudden growth in:

  • Multi-step agentic workflows that call multiple models
  • Data enrichment pipelines running more frequently than expected
  • User-facing features generating longer prompts or outputs
  • Shadow AI tools adopted by teams outside IT

Most cost spikes start as workflow design choices, not unexpected user demand.

Use evaluation frameworks that measure output accuracy, task completion, hallucination rates and downstream impact. The real question isn’t “Is this model good?” It’s “Is this model good enough for the job at this price point?”
Map each step of the workflow and ensure every model call is necessary. Apply routing to cheaper models when possible, limit chain-of-thought verbosity and test whether intermediate steps can be combined. Many agentic pipelines can reduce token usage by 30-50% with no loss of quality.
Cutting the model before understanding the workflow. Workflow redesign — reducing calls, batching jobs, routing intelligently — almost always delivers greater savings than simply switching models.

About the Author
Nathan Eddy

Nathan is a journalist and documentary filmmaker with over 20 years of experience covering business technology topics such as digital marketing, IT employment trends, and data management innovations. His articles have been featured in CIO magazine, InformationWeek, HealthTech, and numerous other renowned publications. Outside of journalism, Nathan is known for his architectural documentaries and advocacy for urban policy issues. Currently residing in Berlin, he continues to work on upcoming films while contemplating a move to Rome to escape the harsh northern winters and immerse himself in the world's finest art. Connect with Nathan Eddy:

Main image: adragan | Adobe Stock
Featured Research