Leadership's message hasn't changed in two years: get AI into production, fast. Enterprises are burning millions on foundation model API access and LLM fine-tuning programs. Back on the engineering floor, though, deployments are stalling. Costs are blowing past projections. And in every all-hands, the room goes quiet when someone asks why.
It's not the model. It's the pipes.
A model is only as good as the infrastructure that can actually move it. Most enterprise infrastructure was built for microservices — stateless, lightweight, horizontally scalable. AI workloads are none of those things. Try forcing 70B parameters through a deployment pipeline designed for a 200MB Spring Boot container and something breaks. Usually everything at once.
Four bottlenecks are killing enterprise AI deployments right now. Here's what they actually look like and what platform teams doing real work are doing about them.
Table of Contents
- 1. Your CI/CD Pipeline Was Never Built For This
- 2. Your Scheduler Has No Idea What a GPU Topology Is
- 3. Distributing Model Weights at Scale Will Saturate Your Network
- 4. Speed Requires Autonomous Governance
- The Actual Bottom Line
1. Your CI/CD Pipeline Was Never Built For This
In a standard microservices shop, a heavy image might hit 500MB. The pipeline eats it in seconds. Nobody files a ticket. But pushing LLM weights is a completely different game. You're suddenly forcing tens or even hundreds of gigabytes of stateful data down that exact same pipe. What used to take three minutes now takes four hours, or a whole weekend if you're unlucky. The worst part? The pipeline rarely throws a hard failure. It just silently grinds your release velocity into nothing and you don't notice until you're two sprints behind.
Throwing more pipelines at it won't save you. What actually works is pulling infrastructure provisioning completely out of the model delivery path — treating them as two unrelated concerns, because they are. Teams that are actually shipping use Crossplane to manage cloud resources as native Kubernetes APIs, paired with ArgoCD for GitOps-driven synchronization. That gets the control plane out of the critical path.
Be precise about what this solves, though. It handles where and how resources get provisioned, but provisioning a GPU node is only half the battle. Before you even worry about transferring those massive model weights, Kubernetes still has to figure out how to assign the workload to the hardware. That brings us to the second bottleneck.
Related Article: Taming GPU Burn: Cut GenAI Costs Without Slowing Delivery
2. Your Scheduler Has No Idea What a GPU Topology Is
Buying H100s doesn't mean your jobs will run efficiently. The Kubernetes default scheduler was designed to place pods based on availability CPU, memory and raw GPU count. It has zero awareness of NUMA nodes, NVLink domains or PCIe switch topologies. For web workloads, that's fine. For AI, it's a silent, expensive performance leak. I've watched teams burn weeks chasing what they thought was a model bug that turned out to be pod placement.
Here's what actually happens: the scheduler picks a node with open GPU capacity and hands the job off. The kubelet's Topology Manager then discovers the requested GPUs span two NUMA nodes. The pod either crashes outright or runs at a fraction of expected throughput. Your multi-million-dollar accelerators sit mostly idle, burning budget, while the dashboard cheerfully reports them as "utilized."
Stop using the default scheduler for AI workloads. Volcano, a CNCF batch scheduler, and NVIDIA's Run:ai both understand GPU topology natively: NVLink meshes, PCIe hierarchies, multi-rack awareness. They support gang scheduling, fair-share queuing and NUMA-aware placement. If your platform team isn't running one of these for AI jobs, you're leaving a substantial cut of your GPU performance on the floor and paying full rate for the privilege.
3. Distributing Model Weights at Scale Will Saturate Your Network
This one catches people off guard. It caught me off guard the first time I saw it in production.
You've finished fine-tuning a 70B parameter model. You need to push it to 250 GPU nodes. If every node independently pulls those weights from a central object store or registry, you've just triggered a thundering herd. The origin absorbs 250 concurrent requests simultaneously. Egress costs spike. A rollout that should finish in minutes stretches for hours with no clean rollback path if something goes sideways mid-flight.
Peer-to-peer distribution is the fix. Instead of every node hammering the origin, nodes pull pieces of the model from each other. In my recent architectural contributions to the CNCF Graduated project Dragonfly, we implemented a P2P AI model distribution pipeline that natively intercepts model hub endpoints. By decentralizing pull requests across the cluster, we achieved a 99.5% reduction in origin traffic, a number that materially changes how you budget for cloud AI at scale. The integration surface is practical: KServe, Triton Inference Server, vLLM and direct support for Hugging Face and ModelScope endpoints. Which is where most teams are actually pulling from. Not OCI registries.
That last point matters more than it seems. Model weights are not container images. Treating them as one produces architectural decisions that look sound on a whiteboard and fall apart the moment real traffic hits. Your container registry handles your application runtime. Your model distribution layer is a separate engineering concern, scope it and fund it as one. Full stop.
4. Speed Requires Autonomous Governance
Fix the plumbing and you hit the final wall: governance. When you can distribute massive models at wire-speed across GPU clusters, human operators can no longer manually track the blast radius of a bad deployment. The surface area is too large. The pace is too fast. No on-call rotation was designed for this.
In my books "Agentic AIOps" and "The AI-Native Tech Organization," I laid out why enterprises need to move from reactive monitoring to autonomous Site Reliability Engineering. The orchestration layer needs to run itself. AI agents query cluster state using standards like the Model Context Protocol (MCP), catch drift before it cascades and execute remediation without paging anyone at 2am.
Humans set policy. Agents handle the blast radius. Building fast pipes is step one building the governance layer to manage them is how you stay alive in production long-term.
Related Article: The Chips Cold War: How GPUs Became the World’s Most Valuable Political Resource
The Actual Bottom Line
Stop scaling AI ambitions on infrastructure that was never designed for them. The teams making real progress right now don't necessarily have the best models. They're the ones who refused to touch production until the infrastructure plumbing was solved — and honestly, that discipline is rarer than it should be.
Topology-aware scheduling isn't a stretch goal. Neither is decoupling your control and data planes, or running P2P model distribution. These are solved problems with production-grade tooling. The only question is whether your platform team has prioritized them yet. If they haven't, your AI rollout is running on borrowed time.
Learn how you can join our contributor community.
