Managing Cloud Spend in the GenAI Age

When IBM announced its acquisition of HashiCorp, a cloud management software provider and IBM’s second-largest acquisition ever, it put the spotlight on the growing need for IT managers everywhere to automate the use of multiple cloud computing platforms.

As enterprises continue to shift to the cloud for access to critical applications, data and development of generative AI-based solutions, they’re using many different cloud platforms, each with complex subscriptions and service-level agreements (SLAs). While automated cloud management solutions, such as those offered by HashiCorp and other vendors, fill a rising need, the fact that they’re needed in the first place points to the growing challenges and high costs needed to manage hybrid and multi-cloud environments and data centers. Managing cloud spend is the number-one challenge of organizations, with more than a quarter of survey respondents claiming to spend over $12 million a year on the cloud, according to a survey conducted by Flexera.

The Cloud and AI

Adding to the growing cloud costs is the explosion of GenAI tools. The memory requirements, compute power, graphics processing units (GPUs) and storage needs required to train AI models are more expensive than standard cloud usage, but they give AI solutions the fast, scalable delivery vehicles they need to be effective.

Cloud providers offer specialized AI services, such as managed machine learning (ML) platforms, pre-trained models and other tools, which come at a high cost. In other cases, enterprises may require bespoke AI solutions tailored to their specific needs, which involve the expertise of solution providers, with additional configuration and support costs. Those types of services, however, allow organizations to leverage expert-built infrastructure and tools, accelerating development and time to market.

Below are some of the AI requirements that can add significantly to cloud workloads:

Massive computational power: Training AI models, especially those involving neural networks, means processing lots of data and requiring massive parallel processing to handle complex mathematical operations. It requires high-performance GPUs and integrated circuits called Tensor Processing Units (TPUs), which are extremely expensive to run.
Robust data storage: AI models require access to extensive data sets for training and validation. This data needs to be stored and managed, and it is often done so across distributed systems. Additionally, AI development often requires iterating on models.
Continuous training: In order to effectively train AI models for continuous learning, multiple versions of data sets and models need to be available, increasing storage needs.
Network bandwidth: The sheer volume of data used to train AI models as well as the fact that development teams may be distributed across the world requires high throughput network infrastructure and fast and seamless collaboration capabilities.

Best Practices to Manage Rising Cloud Costs

While cloud costs can be a major source of concern and contention for enterprises, there are ways that they can be managed. The first step would be to create a dedicated cloud cost management team, comprising finance, IT, line of business and other leaders. The team’s primary responsibility should be to review and approve cloud spending to identify cost-saving opportunities through tighter SLAs and see where there may be overlap of subscriptions.

Managing cloud spend also requires real-time visibility and oversight of spend. There are cloud cost management tools available that can provide cloud usage analytics in real-time as well as alerts when budget thresholds are being realized, so enterprises can identify cost anomalies and predict future needs.

Another key issue to consider as companies move to multi-cloud environments is the data cost. The cost of transferring data from one cloud vendor, such as Amazon AWS, Microsoft Azure or Google Cloud, to another can get expensive, which is the reason many companies lock-in with one vendor. Cloud vendors also charge a fee every time a company retrieves data. While the cost can be small, for large enterprises, it can add up.

A key source of cloud cost to consider is transferring data from one cloud vendor to another. While most major cloud providers have always charged a cost for that transfer, more recently they’re waiving those costs. Earlier this year, Google announced it stopped charging Google Cloud customers a fee to migrate their data to another cloud provider or on-premises data center.

Finally, as AI is a key consumer of cloud resources, it also can be used to better manage those resources. Machine learning solutions can be used to identify areas where overspending is occuring in multi-cloud environments and can help to identify optimum infrastructure configurations.

The cloud is becoming the platform of choice for organizations everywhere, partly because of its ability to reduce the costs associated with resource-intensive on-site data centers. Yet, as it becomes more pervasive across the organization, incremental costs add up, which are now being compounded with the cloud-guzzling needs of AI. Managing those costs by taking a strategic approach to cloud consumption will enable enterprises to optimize the huge potential of AI as well as the advantage of the cloud without breaking the bank.

fa-solid fa-hand-paper Learn how you can join our contributor community.