The Agent Company Took the Air Out of AI Agents' Tires

A group of researchers gave today’s top AI models the chance to run a company. The AI agents lied, got lost, rewrote reality and collapsed under the weight of basic office tasks.

This is what’s supposed to take over our jobs?

Carnegie Mellon's fake software startup, first reported by Business Insider, filled every role with AI agents built with the latest models from OpenAI, Google, Anthropic and Amazon. The goal: find out what happens when machines are expected to do actual jobs, without human backup. Sort of a "Lord of the Flies" meets Skynet experiment.

The answer? They break, sometimes in weird ways.

In the simulated organization, aptly called The Agent Company, each model had real business tasks. Analyze spreadsheets. Write performance reviews. Pick an office. Anthropic’s Claude was the best of the bunch and still failed three-quarters of the time. Gemini, ChatGPT and Nova barely functioned. Amazon’s Nova in particular delivered an impressively bad 1.7% success rate.

Even the most basic tasks cost $6 and took dozens of steps. One agent stalled at closing a pop-up window. Another couldn’t find the right colleague, so it renamed someone else, got the answer it was looking for and carried on.

These weren’t strange edge cases that would trip up even the most experienced employees. These were ordinary business operations, things that real people do on a regular basis. The models simply couldn’t handle them.

AI Agent Hype vs. Reality

Product marketing calls AI agents the future of work. You’ve probably seen some of them too. Microsoft Copilot. Salesforce Agentforce. Autonomous developers building full applications. The promises are big, loud and everywhere.

But Carnegie Mellon ran the receipts. Yes, companies like Honeywell and Lumen are seeing real returns, but within constrained systems. Agents can summarize, assist, compile and sort with clear instructions and tasks.

No doubt, that is real value. But none of it proves they can reason through a broader business problem or act without a map or human guidance.

In fact, the illusion of autonomy collapses in the absence of a well-defined structure. AI agents don’t know what to do next, so they just try something. And most of the time, it doesn’t work.

Throwing people into the mix isn’t the answer, either. People need training to supervise AI agents, and if we’ve learned anything from the lack of leadership training most managers get, that training will be in short supply — at least initially.

If an employee you supervise makes up a co-worker to throw under the bus when they make a mistake, you can reprimand or fire them. When an AI agent does it, what’s the right response? Do you take them out of the flow, breaking the work of other agents? Do you try to course correct and monitor without shutting things down? Do you need to re-orchestrate your entire process?

AI is driven by probabilistic responses rather than deterministic outcomes, making testing different than with traditional software development. Oversight will also be a completely different task going forward. Specialized, trained supervision is absolutely necessary.

How Should Leaders Approach an AI Agent Rollout?

Many organizations are jumping into agentic AI without looking. Carnegie Mellon’s research should give them pause about handing over too much, too fast.

First, your agentic journey should start small and stay grounded. Use agents on boring, rule-bound tasks that you can easily monitor for consistency and quality. Data entry. FAQ triage. Workflow routing. Prove they can follow instructions precisely before you trust them with decisions.

For instance, Jaja Finance's chat assistant "Airi" cut response times by 90%, leading to major gains. Microsoft Copilot Studio builds agents to guide onboarding and deflect IT tickets, exactly the kind of work that benefits from speed and structure. Notably, both use cases include human backstops and escalations.

Second, consistent patterns matter with agentic AI. You'll find it succeeds best in tight lanes. Once nuance or uncertainty enters, performance nosedives (or, at the least, becomes highly variable). While specialized AI agents can handle that variability, they're best used in the narrow use cases they're designed for. Use AI to eliminate friction, not complexity. That comes later. And even if it doesn’t manifest in the way you hope, it won’t be wasted effort.

Set real limits on any agent experiments. Know — or try to predict — how failure shows up. Assign human owners who are briefed on what to look for and how to resolve issues. And make sure someone in the room still understands the process better than the machine.

The rollout of AI agents should trigger even deeper scrutiny. Ask questions like:

Are you automating a burden or trying to hand off responsibility?
Why was this task handled by humans before? What changes with automation?
What’s your plan when, not if, the agent quietly fails?
Who’s accountable when, not if, it fails loudly?
Will this agent be actively supervised, or is it simply running until someone complains?

If you can’t answer these, you’re not deploying AI. You’re inviting chaos.

Stay in Control, Keep Pushing, Know the Line

The Carnegie Mellon study confirmed what many suspected: in spite of the promises of world-dominating results, agentic AI isn’t ready to run the ship. When left to operate unsupervised, even the most advanced models misfire on simple logic, basic interactions and foundational judgment. This isn’t a technical shortfall but a current design limitation.

Still, this isn’t a reason to pull back. It is a reason to focus. Agents provide value in narrow lanes, with clear oversight, while solving targeted problems. The ambition to do more isn’t the problem, the assumption that we’re already there is.

Deploy with purpose. Test relentlessly. Watch the edge cases. And above all, don’t let novelty distract you from responsibility.

Agentic AI is a tool with potential. But only if you stay in control and are clear on where it fits and where it doesn’t.

Learning Opportunities

Webinar

Dec

[EIS Webinar] Beyond the Pilot: Why Most GenAI Projects Fail to Scale and How to Become One of the Success Stories

Move from experimental projects to integrated solutions that drive strategic decision-making.

Webinar

On demand

From Manual to Magical: How AI Transforms CX Teams

Learn how to replace manual support processes with automation that actually delivers.

Watch Now

Webinar

On demand

How to Build a Solid Knowledge Foundation for AI Success

See how leading brands keep their AI honest, compliant and actually helpful.

Watch Now

Webinar

On demand

Fix the Content Bottleneck: Build a Better WebOps Strategy

Content stalled? Dev overloaded? You’re not the only one. Learn how streamlined WebOps bridges the publishing gap.

Watch Now

Webinar

On demand

Beyond Storage: Smarter Content, Bigger Impact with DAM + AI

Discover how the DAM + AI duo makes content smarter, stronger and more accessible.

Watch Now

Webinar

On demand

Agentic AI Playbook: Real-World Customer Service Use Cases You Can Deploy Now

Boost self-service by 30% and slash call volume by 63% with agentic AI.

Watch Now

Webinar

Dec

[EIS Webinar] Beyond the Pilot: Why Most GenAI Projects Fail to Scale and How to Become One of the Success Stories

Move from experimental projects to integrated solutions that drive strategic decision-making.

Webinar

On demand

From Manual to Magical: How AI Transforms CX Teams

Learn how to replace manual support processes with automation that actually delivers.

Watch Now

Webinar

On demand

How to Build a Solid Knowledge Foundation for AI Success

See how leading brands keep their AI honest, compliant and actually helpful.

Watch Now

Editor's Note: Read more happenings in the agentic AI world:

Who Watches the AI? Why Agentic AI Needs Observability Platforms — Agentic AI has greater potential but also higher risks than traditional LLMs. Observability platforms play a part in making sure things don't go off the rails.
Will Your Next Hire Be an AI Agent? — Autonomous AI agents are making their way into every corner of the workplace. A look at where we are and where we're headed.
IT the HR of Agentic AI? Not So Fast — NVIDIA CEO Jensen Huang said IT will become the HR of agentic AI. Sounds nice, but it's a huge oversimplification. Here's why.