Several AI suppliers across the globe are building supercomputers to support both themselves and their customers with the compute power needed to support AI projects now and into the future.
Let's take a look at the top enterprises involved in this endeavor.
OpenAI (With Microsoft)
In collaboration with Microsoft, OpenAI is planning to build a $100 billion data center and supercomputer for creating state-of-the-art AI that is far more capable than today’s technology. According to published reports, the supercomputer, currently named “Stargate,” could be running by 2028, with plans to expand over a couple of years. The supercomputer would require as much as five gigawatts of power.
No location for the data center and supercomputer has been chosen. Another, smaller supercomputer is reportedly planned for Wisconsin and would start operations as early as 2026.
Microsoft (Separate From OpenAI)
Microsoft is also deploying supercomputers outside of its partnership with OpenAI. At the Microsoft Build event earlier this year, Azure CTO Mark Russinovich said Microsoft was starting up the equivalent of five 561 petaflops Eagle supercomputers every month.
Eagle is one of the world’s most powerful supercomputers, at 561.2 petaflops.
"We secured that place [in the Top500] with 14,400 networked Nvidia H100 GPUs and 561 petaflops of compute, which at the time represented just a fraction of the ultimate scale of that supercomputer," Russinovich said at the event. "Our AI system is now orders of magnitude bigger and changing every day and every hour.”
The company’s future supercomputer plans are based on the performance of Willow, the company’s latest quantum chip. Hartmut Neven, VP of engineering at Google and leader of the Quantum AI Lab, said in a December blog post.
Willow has state-of-the-art performance across a number of metrics, enabling two major achievements:
- “Willow can reduce errors exponentially as we scale up using more qubits," according to Neven. "This cracks a key challenge in quantum error correction that the field has pursued for almost 30 years.
- And, said Neven, “Willow performed a standard benchmark computation in under five minutes that would take one of today’s fastest supercomputers 10 septillion (that is, 1025) years — a number that vastly exceeds the age of the universe.”
Related Article: New AI Policies From Biden: What They Mean for the US and the World
Meta
In the spring, Meta announced two 24k GPU clusters built on top of Grand Teton, OpenRack and PyTorch.
By the end of 2024, the company planned to have 350,000 NVIDIA H100 GPUs as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s.
The two newly announced clusters support the company’s current and next generation AI models, including Llama 3, the successor to Llama 2, Meta’s publicly released LLM, as well as AI research and development across GenAI and other areas.
Amazon (With Anthropic)
This past November, Amazon announced it was deepening its collaboration with Anthropic, which will use AWS Trainium and Inferentia chips to train and deploy its future foundation models.
Amazon and Anthropic also previously announced a strategic collaboration, which included Anthropic naming Amazon Web Services (AWS) its primary cloud provider and Amazon making a $4 billion investment in Anthropic.
This next phase is expected to further enhance the performance, security and privacy Amazon Bedrock provides for customers running Claude models. Anthropic's Claude 3.5 Haiku and upgraded Claude 3.5 Sonnet in Amazon Bedrock will be the AI company's most intelligent models to date.
Nvidia
In mid-December, Nvidia announced a new compact generative AI supercomputer: the Jetson Orin Nano Super Developer Kit is priced at $249. The technology it delivers as much as a 1.7x increase in generative AI inference performance, a 70% increase in performance to 67 INT8 TOPS and a 50% increase in memory bandwidth to 102GB/s compared with its predecessor.
The developer kit consists of a Jetson Orin Nano 8GB system-on-module (SoM) and a reference carrier board, providing an ideal platform for prototyping edge AI applications.
The SoM features a NVIDIA Ampere architecture GPU with tensor cores and a 6-core Arm CPU. It can support up to four cameras, offering higher resolution and frame rates than previous versions.
IBM
IBM’s next-generation quantum computer architecture quantum processor will be deployed at the RIKEN Center for Computational Science in Kobe, Japan, and will be the only instance of a quantum computer co-located with the supercomputer Fugaku, according to IBM.
RIKEN has dedicated use of an IBM Quantum System Two architecture for the purpose of implementation of its project. In addition to the project, IBM will work to develop the software stack dedicated to generating and executing integrated quantum-classical workflows in a heterogeneous quantum-HPC hybrid computing environment. These new capabilities will be geared towards delivering improvements in algorithm quality and execution times.
IBM Quantum System Two, which will be deployed at RIKEN and integrated with Fugaku, includes IBM’s plans to introduce its next-generation quantum computing architecture, combining expandable cryogenic infrastructure, modular quantum control electronics and advanced system software.
“As the first quantum system that will directly connect with the Fugaku classical supercomputer, IBM's agreement with RIKEN marks a monumental milestone in the journey towards a future defined by quantum-centric supercomputing,” Jay Gambetta, who leads the team at IBM working to build a quantum computer, said in a blog post.
xAI
Elon Musk's xAI is ramping up its Memphis, TN-based supercomputer to house at least one million graphic processing units. Colossus, the world’s largest AI supercomputer, is being used to train xAI’s Grok family of large language models, with chatbots offered as a feature for X Premium subscribers.
According to Nvidia, the company whose chips are integral in the operation of Colossus, the system has experienced zero application latency degradation or packet loss due to flow collisions. It has maintained 95% data throughput enabled by Spectrum-X congestion control. This level of performance cannot be achieved at scale with standard Ethernet, which creates thousands of flow collisions while delivering only 60% data throughput.
"In Memphis, we're pioneering development in the heartland of America," Brent Mayo, an xAI engineer, said in a statement. "We're not just leading from the front; we're accelerating progress at an unprecedented pace while ensuring the stability of the grid utilizing megapack technology."
Related Article: The State of AI: Top Trends and Missteps Ahead
Cerebras Systems
In March, the company introduced the Wafer Scale Engine 3, designed to provide twice the performance of the Cerebras WSE-2, with the same power draw and for the same price. Purpose built for training the industry’s largest AI models, the 5nm-based, 4 trillion transistor WSE-3 powers the Cerebras CS-3 AI supercomputer.
The supercomputer includes 4 trillion transistors, 900,000 AI cores, 125 petaflops of peak AI performance, 44GB on-chip SRAM, 5nm TSMC process and a cluster size of up to 2048 CS-3 systems. The CS-3 is designed to train next generation frontier models. Twenty-four trillion parameter models can be stored in a single logical memory space without partitioning or refactoring,
The CS-3 is built for both enterprise and hyperscale needs. Compact four system configurations can fine tune 70B models in a day while at full scale using 2048 systems, Llama 70B can be trained from scratch in a single day, according to the company.
The University at Albany
The university’s new, $16.5M supercomputer, unveiled in October, is the most powerful in the State University of New York (SUNY) system.
The supercomputer’s systems — slotted in server cabinets in the University’s Data Center — contain 192 NVIDIA Tensor Core GPUs, networked with about three miles of fiber-optic cabling. The accelerated computing power of GPUs are designed to quickly handle the high-intensity AI tasks that require massive numbers of parallel calculations with large datasets.
“We’ve always had a high-performance computing cluster, but this is dramatically different,” said Brian Heaton, UAlbany’s chief information officer. “AI calls for intense mathematical computations. Your average computer is not able to keep up the pace of those calculations in order to make it productive in a reasonable time frame. A researcher developing AI can’t wait weeks and weeks for the results to come back. The more horsepower we can give our researchers and our faculty, the faster response times they get and the more productive they’ll be in the AI space.”