Current statistics about the state of AI show that most organizations have moved from the “interested in” phase to deployment mode.
In the last two months alone, McKinsey’s Global Survey on AI found that 65% of reporting organizations regularly use generative AI (GenAI), while Microsoft and LinkedIn’s 2024 Work Trend Index found that 75% of employees surveyed use GenAI in their daily work.
With the rise of adoption and AI successes, there is also inevitably a rise in failures. A quick search on the term “AI failures” yielded a number of interesting lists, along with an entire subreddit devoted to “unintentionally hilarious and terrible” AI fails.
Quality encompasses many things, and it’s clear that organizations releasing AI tools need to be testing for all of them to ensure that these tools work as intended.
The Risks of Rushing AI to Market
The reality is that many organizations are under pressure to release software faster — a pressure that is rising as companies race to incorporate AI. This rush can lead to trading off comprehensive testing for speed, which in turn risks quality and user experience (UX) issues. That’s if your new product or feature even makes it off the ground, as nearly one-third of software projects fail due to inadequate testing. Balancing speed, quality and cost is an infamously tough balancing act.
AI takes these issues and supercharges them. In addition to identifying defects, errors and bugs, along with flaws in user experience, QA teams also need to look at the risks that GenAI can introduce: inaccuracy, bias, safety and security.
My last blog talked about how consumers use generative AI and explores the types of user experience (UX) flaws and harmful outputs they discovered. Often, these problems happen because of poor quality or insufficient data used to train the algorithm, things that could have been addressed earlier.
Testing GenAI is also very different from testing traditional software, because its outcomes are not deterministic. In the past, QA teams could have a pretty good idea before they started testing on the sorts of issues they are likely to encounter, which made writing test cases easier. Generative AI is trickier because its responses are always different; it may pass tests sometimes and fail them others. This adds greater complexity to testing.
Related Article: How AI Bias Creates Dependency and Inequality
The Importance of User Experience for the Bottom Line
What many organizations fail to consider when they rush to market is that a poor GenAI user experience can directly affect the bottom line.
There are a number of fascinating UX stats, but two emphasize this point very well. One report found that 77% of consumers stopped using or deleted an app due to poor performance within the previous 12 months. Another found that users abandon their shopping carts at an average rate of around 70%, with many citing bad UX — like poor site design, complicated checkout processes and more — as the reason.
It’s plain to see: Flawless digital experiences are not a nice to have for the end user, they are the bare minimum.
But because generative AI is so new, there’s very little guidance on how to build a great application with a good user experience. Both Google and IBM, among others, have begun documenting general guidelines and design principles for these applications that include how to design for the end user experience and incorporate user feedback. This feedback can be fed back to QA to ensure that the applications are working as intended for both the organization and the user.
New Strategies for AI Testing
QA teams are now starting to update their approaches on how to effectively test AI (both traditional and GenAI-infused applications). For example, designing tests around adherence to Retrieval-Augmented Generation (RAG) and System Prompt settings, defining risk tolerance and incorporating adversarial (red team) testing, defining quality measurement criteria such as the level of accuracy and tone of responses, directed prompt testing, benchmarking different models and versions and establishing monitoring approaches in production.
These approaches allow teams to simulate attacks and probe for weaknesses, and, most importantly, fix these issues before they are released to end users. In addition, there is growing recognition for the need to have natural usage testing as a key part of the testing plan. This involves having a team of real world users stress test the application under multiple real world conditions.
And by real-world conditions, we mean the whole real world, not just within North America. Even though most GenAI applications are being developed for a global market, testing is done predominantly in English. QA and test teams must make sure that the voice elements of generative AI account for different accents, ages and speech impairments on top of regional dialects across multiple languages.
When we talk about AI testing, we are not just addressing companies developing LLMs. Over the coming years, we will see more companies from across industries developing AI agents. These are another kettle of fish because companies must ensure that their AI assistants have been properly trained and tested to integrate well with their specific back-end company systems and data.
Related Article: The Impact of Generative AI on E-Commerce Search
The Future of QA: Integrating Quality From the Start
As more organizations roll out generative AI applications, quality assurance will continue to evolve and become more integrated into the development process. Ideally, digital quality will focus less on finding defects and more on creating systems and processes at all stages of development that prevent them from occurring in the first place.
Balancing speed and quality will help ensure that the applications built not only meet market demands, but demonstrate to users the promise of AI. It will also ensure that the proper checks and balances are in place to recognize when AI is being used for malicious purposes.
Think of the recent example of the student who built a nuclear reactor in his room with GenAI. Red teaming, in this instance, can help make the AI application recognize when people are using it for scientific research versus building weapons.
QA’s future will be less about verifying technical problems and more about assuring that the quality of interactions are where they should be.
Learn how you can join our contributor community.