The outage came suddenly, as though someone pulled the plug on cloud services.
Amazon Web Services (AWS) powers countless apps, websites and software — everything from Snapchat to Roblox. The outage happened back in October, but how it impacted one business still serves as a significant lesson on AI's indispensability.
Table of Contents
- A Single Outage, Systemwide Failure
- The Billion-Dollar Domino Effect
- How an AI Outage Impacts Business Operations
- How to Mitigate an AI Disaster
A Single Outage, Systemwide Failure
Michael Pedrotti, co-founder of GhostCap, noticed the chatter on their internal comms channels related to AI services they use for marketing. Pedrotti uses AI to generate keywords, create content and track the success of their SEO work. He knew something was amiss when he started investigating the malfunctions one Monday afternoon.
“Immediately after the outage, all of these critical functions stopped functioning, demonstrating how dependent the fundamental digital marketing operations are on a single central cloud architecture," he said.
These AI services didn’t just slow down or become erratic —- they stopped running entirely, leading to what Pedrotti called “hard errors” that revealed a major AWS outage.
Related Article: Why 90% of Companies Are Rethinking Cloud Strategies
The Billion-Dollar Domino Effect
Pedrotti was not alone in suspecting the outage. It impacted countless businesses that use the Amazon cloud infrastructure to host their apps and services. According to data collected by Okta, the outage led to reports from 17 million users that AWS was not functioning. Two of the most prominent AI chatbots, Claude and Perplexity, also experienced related outages.
There was a vast financial impact as well that goes beyond user productivity and IT operations. CNN reported the AWS outage could cost hundreds of billions of dollars.
For companies that rely on AWS for AI services, the outage pointed to a serious underlying problem where a single point of failure can disrupt business operations. Experts say that while it's tempting to rely on the cloud for AI services, the impact to business, IT operations and end-user productivity is far too great to see AI as merely experimental or non-critical.
Alternatives to hosting AI services in the cloud, such as running a local large language model on internal servers and creating failsafe backups, will become increasingly important as more and more companies depend on AI services. It’s a wake-up call to evaluate IT operations and cloud dependencies, since the outage (which was mostly on the East Coast) was so extensive.
How an AI Outage Impacts Business Operations
The first thing to know about the AWS outage and how it impacted AI services is that it was far more than just a disruption with chatbots like Claude or ChatGPT.
While those bots are high profile and often an integral part of corporate knowledge worker productivity, the outage goes much deeper than that. AWS is the cloud provider for hundreds of companies, including Netflix, United Airlines, Lyft and the educational platform Canvas. At each of those companies, AI is intertwined in complex ways.
Invisible AI Pipelines, Exposed
Sonu Kapoor, an IT and development consultant for companies like Citigroup, Visa, Cisco and Sony Music Publishing, said the deep tendrils of AI have expanded everywhere to the point where it's impossible to know where all of the dependencies even exist.
“[The AI outage] cascaded into enterprise workflows that quietly rely on the same backbone. Sony’s East Coast engineering teams lost access to Amazon Q, their AI assistant integrated into VS Code for documentation lookups and coding suggestions. Those few hours of downtime may not make headlines, but they stall code reviews, disrupt pipelines and force developers to revert to manual processes — an expensive slowdown at enterprise scale.”
The Weak Link No One Watches
Kapoor pointed to another example at Sony to help explain how far-reaching AI services are in the modern enterprise and how they can impact authentication and identity services.
“The outage also affected internal authentication flows,” he said. “Several AI-assisted developer tools and dashboards that depended on AWS-hosted identity services couldn’t authenticate. Even though the underlying codebases weren’t down, access bottlenecks made the tools effectively unusable. It was a reminder that AI reliability isn’t just about model uptime; it’s also about the invisible layers of identity, routing and permissions that support them.”
From Back Office to Frontline
While those are all more technical examples, an AI outage can also have ramifications far beyond web development and the security authentication that Kapoor mentioned.
Dr. John Bates, author and CEO of SER Group, said an AI outage like the one that AWS caused can “ripple across an organization’s entire back office, business operations and customer interactions, highlighting the importance of robust monitoring, redundancy and contingency planning. Revenue-driven systems in ecommerce, fintech or other industries may be directly affected if AI-driven recommendations, pricing or fraud detection fail.”
Andrew Sharp, research lead at Info-Tech Research Group, echoed this sentiment, saying AI is now intertwined in all of IT and the enterprise in ways that make it impossible to separate the two.
“An outage broadly indicates how embedded and integrated our digital systems for business have become. Many organizations don't know which AI systems are critical to their operations or business directly, or the impact on the organization of trying to operate without that functionality. Risk management is important but rarely urgent, and when there is a major outage, it becomes clear who wasn't prepared.”
The Silver Lining
AI innovation is happening so fast that some companies have not been able to keep up in terms of making sure the services are accessible at all times, similar to a website being available. Instead, it’s too easy to deploy the services, make them part of a web development or security process and then watch helplessly when an outage grinds operations to a halt.
Fortunately, the experts say the AWS outage has a silver lining — it can serve as an impetus to evaluate all operations and determine exactly how to keep AI running during downtime.
Related Article: Your Science Fair Is Over. It's Time to Build the AI Factory
How to Mitigate an AI Disaster
AI services are vital to many companies these days, both from an end user standpoint — e.g., running ChatGPT to stay productive during the day — to many backend IT services.
As a fairly recent innovation, AI chatbots such as ChatGPT, Claude and Perplexity are useful for knowledge workers in particular. In fact, OpenAI recently claimed more than 700 million people use ChatGPT every day.
Yet, the experts warned that uptime for those chatbots is a small part of the overall AI infrastructure landscape. A good place to start is evaluating how reliant a firm is on AI services and how far-reaching these services are for all business operations.
Build for Failure, Not Perfection
Kapoor said AI services are often viewed as an afterthought — something tacked onto existing IT operations because they are new and innovative, but not viewed through the same lens as a critical business function. Instead, they are viewed as experimental. But as AI becomes integrated into IT services more and more, companies should reevaluate their AI operations.
“Companies need to design for graceful degradation and build fallback logic so core features continue to function when AI endpoints or authentication services fail,” he explained. “They also need to implement local or cached inference [which is the processing AI does on a large language model that leads to an actionable result]. For essential features, run smaller models locally or store last-known outputs to avoid complete downtime.”
Don’t Bet Everything on the Cloud
Pedrotti said his company does not rely on AI services entirely for their business, so the outage mostly impacted their marketing and SEO work. It's important, he added, to make sure any critical functions have a failover option for all mission critical services.
“I am reminded of the necessity for businesses to be using a hybrid model to manage their dependency on technology,” he said. “On the one hand, we use cloud providers for scalable capacity and global availability for our gaming guide content and distribution but intentionally use bare-metal servers for our main web hosting and critical backend infrastructure.”
Diversify Your AI Stack
Dr. Bates also mentioned another wise practice — to diversify your AI services across multiple cloud providers instead of relying on only one. The goal is to treat AI as a mission critical service, similar to accounting services or a website that needs to be continually operational.
“Distributing workloads across multiple regions or platforms such as AWS, Azure, Google Cloud Platform, as well as Oracle and Tencent, is a good move. Implementing automatic failover so that another system can take over immediately and maintaining regular, accessible backups to ensure business continuity during outages also makes sense.”
In the end, the time is now to take a close look at how your business is impacted by AI and how critical these services can be. The outage caused major disruption, but it can also help companies build a better foundation for AI — one that can withstand a worst-case scenario.