LLM Failures Are Raising Red Flags — Can Enterprises Rely on AI?

Business interest in generative AI remains strong, with 88% of organizations closely monitoring its evolution. Nearly half (49%) plan to adopt genAI early, while another 37% see it as part of their future strategy. Yet, despite this enthusiasm, concerns around data security, regulatory compliance, accuracy, ethics and unintended consequences could stall adoption.

To ensure genAI delivers real business value, organizations must take a strategic approach — focusing on responsible AI governance, rigorous data management and transparency in AI decision-making. Success requires a balance between innovation and risk mitigation, embedding AI ethics, bias detection and compliance into deployment strategies. Only by tackling these challenges can genAI evolve from an experimental tool to a true competitive advantage.

To explore solutions, I interviewed Jonah Midanik, General Partner at Forum Ventures and Kevin Wu, Founder & CEO at Pegasi AI, for their expert guidance.

The Risks of Wayward LLMs

What are the business risks of inaccurate and unpredictable LLM outputs?

Midanik said, “The risks of inaccurate and unpredictable LLM outputs are substantial, both operationally and financially. We’re seeing this play out across industries in real time. One of the biggest issues is flawed decision-making. If an LLM provides incorrect information, businesses act on bad data. That’s a direct risk to operations, strategy and even compliance.

"We see this frequently — companies deploy AI tools, get unreliable outputs and then must spend time verifying and correcting them. That inefficiency alone can kill productivity. In many cases, the time spent checking an LLM’s work negates the time savings AI was supposed to deliver. That’s one of the reasons why so many AI pilots don’t make it into full-scale production.”

Worse yet, Midanik added, “There’s also a real financial cost. If companies make business decisions based on false information, the consequences can be significant. A well-known example is Air Canada, where a chatbot hallucinated a policy that didn’t exist, leading to legal action and financial penalties. Air Canada had to pay out on the hallucinated information. Compliance risks like this are still a major blind spot in AI adoption.

"Customer trust is another issue. LLM-powered chatbots are often the frontline for customer interactions, and when they give incorrect information about products or policies, customers don’t blame the AI — they blame the company. That erodes credibility and damages the brand. Trust is one of the most valuable assets a company has, and AI that can’t be relied on undermines it.”

Plus, Midanik said, “There’s security. LLMs can inadvertently cause data leakage, especially when trained on internal emails or sensitive company information. We’ve already seen cases where LLM APIs have been exploited, including in the financial sector. When that happens, it’s not just a technical vulnerability, it’s a serious business risk with regulatory and reputational fallout.”

The bottom line, according to Midanik? “AI mistakes aren’t just theoretical — they’re expensive. Whether it’s decision-making errors, productivity losses, compliance failures, customer trust issues or security vulnerabilities, businesses need to be acutely aware of these risks before integrating LLMs into critical workflows.”

Why LLMs Get It Wrong

What are the key reasons LLMs generate inaccurate or unpredictable outputs?

"Think of a large language model like a global game of 'telephone' being played by management consultants at hyperspeed," said Wu. "Picture the original message 'improve operational efficiency' getting passed through a chain of MBB firms. By the time it reaches the end, it's morphed into 'leverage blockchain-enabled synergies to disrupt the metaverse value chain.' But here's the twist: imagine this game being played simultaneously by thousands of consultants, each armed with their own frameworks, buzzwords and 2x2 matrices. This is similar to how an LLM processes information through billions of neural connections, each adding their own McKinsey-inspired spin to the output. Even with the same initial prompt, you'll get different variations of 'synergy-driven paradigm shifts' each time.

This nondeterministic behavior isn't a bug — it's a fundamental feature stemming from how these systems are built. The evidence is clear in the benchmark data: since GPT-4's release in March 2023, every new OpenAI model has shown higher hallucination rates on text-based question answering. On OpenAI’s SimpleQA benchmark, newer models like o3-mini achieve only 13.4% accuracy even on the simplest questions.”

According to Wu, “several key factors drive this inherent unreliability:

1. Statistical Pattern Matching

These systems don't truly understand language or facts. These systems don't truly understand business realities — they're making sophisticated probability calculations based on training patterns, like a consultant who's memorized every HBR case study but never operated a business.

2. Training Data Limitations

No dataset, no matter how vast, can capture every possible real-world scenario. The training data itself often contains contradictions and biases that get baked into the model's responses.

3. Architectural Constraints

Models can only see a limited window of context at once
Each generated token becomes context for the next, causing errors to compound
There's no reliable mechanism for fact-checking against ground truth

The implications are significant: these aren't temporary limitations we can engineer away — they're fundamental characteristics of how language models work. Until we develop radically different approaches to AI, unreliable outputs will remain part of the package.”

Strategies for GenAI Accuracy and Reliability

What technology capabilities are required to ensure LLMs produce accurate, context-aware outputs aligned with business needs?

Wu said, “In high-stakes workflows, trusted AI demands more than powerful LLMs. By harnessing these models and augmenting them with real-time third-party AI systems for fact-checking and correction — with complete audit trails — organizations ensure outputs that are secure, accurate and context-aware. Prioritizing rigorous alignment with core objectives, customized guardrails and iterative testing and evaluations not only drives smarter decisions but also secures a lasting competitive edge.”

How can organizations ensure LLMs provide accurate, context-aware outputs while maintaining strong privacy, security, and risk management safeguards?

Wu argued that “organizations can ensure LLMs deliver accurate, context-aware outputs while upholding privacy, security and risk management by adopting these best practices:

Robust Data & Fine-Tuning: Establish governed data pipelines and fine-tune models on domain-specific data to capture relevant context.
Alignment & Guardrails: Align outputs with core business objectives through customized prompt engineering and ethical guidelines.
Autocorrection Systems: Leverage compound AI systems to review and automatically correct outputs, filtering unwanted responses.
Continuous Testing & Monitoring: Implement iterative testing, real-time fact-checking and audit trails to ensure ongoing accuracy and compliance.
Privacy & Risk Management: Secure data with encryption, adhere to regulatory standards and conduct regular risk assessments and audits.”

Wu added what is needed is “a multi-layered approach [that] helps drive smarter decisions while safeguarding sensitive information and managing risks.”

Where Agentic AI & Other Tech Fit Into the Equation

How does improving LLM accuracy and predictability enable the transition to agentic AI?

Midanik said, “The shift from standalone LLM outputs to agentic AI — where AI systems autonomously perform multi-step tasks — depends entirely on accuracy and predictability. It’s not just important; it’s a necessary condition. Agentic AI isn’t about generating a single response — it’s about chaining responses together, where each output informs the next step.

"The problem is that errors compound. A mistake that might seem minor in one isolated response can cascade through the chain, distorting every subsequent decision. What starts as a single incorrect output turns into an entire workflow of flawed actions, pushing the system further and further from the correct outcome. The cost of these mistakes doesn’t just add up — it multiplies.

“For agentic AI to function in real-world scenarios, every step in the chain needs to be correct, predictable and verifiable. Otherwise, businesses will be left with AI agents that generate mountains of incorrect work, requiring extensive human intervention to fix — defeating the entire purpose of automation. The stakes are even higher when these agents interact with critical business processes, financial systems or compliance-heavy industries. This is why accuracy isn’t just a quality metric for LLMs — it’s the foundation of whether agentic AI can actually deliver on its promise. Without predictable and checked outputs at every step, autonomous AI won’t scale beyond controlled demos and research labs.”

Where does AI alignment fit within the genAI technology stack?

According to Wu, “AI alignment must be integrated from day one and maintained throughout the GenAI lifecycle. Key practices include:

Robust Testing & Purple Teaming: Conduct rigorous pre-deployment evaluations to identify vulnerabilities and misalignments. Build and train customized guardrails tailored to your enterprise workflow.
Sophisticated Automated Evaluations: Continuously assess model performance and alignment through advanced automated evaluations, ensuring the AI remains secure, context-aware and aligned with evolving objectives.
Post-Deployment Autocorrection: Utilize real-time monitoring and automated autocorrection systems to adjust outputs and maintain trustworthiness.

This comprehensive approach ensures that AI systems not only start aligned but also remain so across their entire lifecycle.”

Learning Opportunities

Webinar

Nov

Fix the Content Bottleneck: Build a Better WebOps Strategy

Content stalled? Dev overloaded? You’re not the only one. Learn how streamlined WebOps bridges the publishing gap.

Webinar

On demand

Beyond Storage: Smarter Content, Bigger Impact with DAM + AI

Discover how the DAM + AI duo makes content smarter, stronger and more accessible.

Watch Now

Webinar

On demand

Agentic AI Playbook: Real-World Customer Service Use Cases You Can Deploy Now

Boost self-service by 30% and slash call volume by 63% with agentic AI.

Watch Now

Webinar

On demand

CMS Briefing: A Live Look at What’s Next in AI-Driven Platforms

Learn how leading organizations are using AI‑driven tools to publish faster, personalize smarter and stay secure.

Watch Now

Webinar

On demand

Ready or Not: How Data-First Organizations Are Unlocking Agentforce Potential

Learn how to cut through the noise, activate Agentforce and build a Salesforce AI strategy that actually delivers.

Watch Now

Webinar

On demand

AI in Customer Service: Faster Resolutions, Happier Customers

Don’t let rising demand burn out your team. See how to build a smarter, more resilient support org.

Watch Now

Webinar

Nov

Fix the Content Bottleneck: Build a Better WebOps Strategy

Content stalled? Dev overloaded? You’re not the only one. Learn how streamlined WebOps bridges the publishing gap.

Webinar

On demand

Beyond Storage: Smarter Content, Bigger Impact with DAM + AI

Discover how the DAM + AI duo makes content smarter, stronger and more accessible.

Watch Now

Webinar

On demand

Agentic AI Playbook: Real-World Customer Service Use Cases You Can Deploy Now

Boost self-service by 30% and slash call volume by 63% with agentic AI.

Watch Now

What architectural considerations should CIOs and CDOs prioritize when deploying genAI in the enterprise?

Wu argued that “CIOs and CDOs should not rely solely on model providers like OpenAI or Anthropic. Just as in cybersecurity, enterprises need independent, continuous validations to ensure that any LLM aligns with their unique business objectives and workflows. Key architectural considerations include:

Continuous AI Purple Teaming: Implement ongoing red teaming (attacking the system), blue teaming (defending the system) and AI purple teaming — collaborative efforts that help build customized guardrails and evaluations tailored to specific enterprise workflows.
Third-Party Validations With Agents as Judges: Engage independent third parties for continuous validation of LLM outputs, ensuring adherence to compliance standards and business values across various models.
Autocorrection Layer: Integrate an autocorrection system that automates expensive SME evaluations during testing and improves accuracy post-deployment by dynamically reviewing and correcting outputs in real time, complete with a comprehensive audit trail.

Combining these strategies creates a scalable, reliable and governable AI ecosystem that minimizes risks and bolsters trust in AI-driven decision-making.”

Turning AI Challenges Into Business Strengths

GenAI’s potential is undeniable, but realizing real business value requires a strategic and responsible approach. Here are four key takeaways:

Governance & Security Are Imperative: “Our data show that organizations most often cite security and privacy concerns potentially impacting their adoption and implementation of generative AI, with compliance a distant second,” said Brian Lett, research director at Dresner Advisory Services. “People view these security and privacy concerns as top of mind almost two times more often than they perceive quality and accuracy of responses as their biggest concern.” AI systems must align with business objectives while maintaining strong privacy, security and compliance safeguards. Continuous testing, audit trails and regulatory adherence should be embedded into AI workflows.
Accuracy Is Non-Negotiable: Unchecked AI outputs can lead to flawed decision-making, financial losses, compliance failures and eroded customer trust. Organizations must invest in fact-checking, validation and autocorrection layers to mitigate risks. “Quality and accuracy of responses are important and remain a very high priority. They’re just not organizations’ top priority,” Lett added. “Many of these organizations likely fear higher-potential fallout resulting from security-related issues much more than having to deal with a hallucination or inaccurate response.”
Alignment Drives Trust & Scalability: Ensuring AI outputs remain context-aware and mission-aligned requires robust training, real-time monitoring and independent validation. Without alignment, agentic AI will struggle to scale beyond controlled environments.
AI Infrastructure Must Be Enterprise-Ready: Relying solely on model providers isn’t enough. Enterprises need custom AI governance strategies, third-party validations and scalable architectures to balance innovation with risk management.

By addressing these challenges head-on, businesses can move beyond AI experimentation and unlock true competitive advantage.

fa-solid fa-hand-paper Learn how you can join our contributor community.