AI Testing Under Pressure as Companies Race to Stay Ahead

When OpenAI’s ChatGPT was released toward the end of 2022, it quickly became the fastest-adopted technology product ever, reaching 100 million monthly active users just two months after launch.

Machine learning (ML) and AI had been around for years prior to this, but as I noted in my first blog for VKTR, ChatGPT helped move them "from the realm of sci-fi and tech into public consciousness.” It’s been a hyped-up and crazy technology cycle that got even wilder at the end of January when Chinese AI startup DeepSeek rolled out the newest version of its app, which quickly surpassed ChatGPT as the highest-rated free app on the App Store in the United States.

There were numerous articles written about DeepSeek’s impact on the global AI market, and then seemingly just minutes later, Alibaba claimed that its new AI model can perform better than anything existing on the market. In February, AI players like Baidu and others began to offer their services for free in response to DeepSeek’s popularity.

How all of this plays out in terms of market dominance is anyone’s guess, but the news gives us an opportunity to talk (again) (and more) about the risks inherent in AI, how this will only increase the pressure on companies to compete and release faster and why, as AI reasoning gets better and new applications and use cases explode, testing becomes even more critical.

Generative AI Is Risky Business

Due to its speed, range and general unknowns, generative AI compounds existing risks in applications and introduces new ones. These range from biased, toxic, inaccurate and inconsistent responses to misuse from bad actors, legal and security risks and regulatory compliance with local, regional and national requirements — and more may emerge. Given the complexity and probabilistic nature of the systems powering genAI, and countless use cases, you’re not sure what kind of responses you'll get in many cases unless you define them very, very clearly.

For organizations looking to optimize their genAI training and testing, it’s important to start with a risk framework. This needs to be specific to your organization, and regularly reviewed and updated. You should:

Define the risks that are of highest concern and how you plan to assess them.
Next, decide where these testing procedures fit into your software development life cycle (SDLC).
Finally, make sure that you are conducting ongoing testing to uncover not just problems, but feedback on how the system is performing in the real world.

There is always going to be some level of risk with generative AI. I’ve talked in past blogs about the importance of red teaming, a technique designed to identify points of failure that can be tough to surface through automated tests alone. Developers can then use the resulting information to retrain models or develop “guardrails” — that is, rules to mitigate risk. This systematic, adversarial approach is employed by human testers and reduces issues in AI models and solutions by focusing on common problems related to security, safety, accuracy, functionality and performance.

Red-Teaming Best Practices

Microsoft had one of the first red teams in the industry and regularly uses the technique as part of its genAI product development. It recently released a white paper outlining key findings from red teaming over 100 AI products.

Its three key takeaways from the experience mirror what my own company has discovered in our AI testing and training engagements. These include:

“Generative AI systems amplify existing security risks and introduce new ones.”
“Defense in depth is key for keeping AI systems safe.”
“Humans are at the center of improving and securing AI.”

Microsoft cites subject-matter expertise, cultural competence and emotional intelligence as key reasons humans are important to keeping AI safe and secure. I agree.

The most effective testing for any application is to leverage real-world users to explore the “mights” — how might a person in X country, with XYZ devices, interact with this? How might a user respond to this answer?

Learning Opportunities

Webinar

Dec

Rebrand. Migrate. Optimize. How to Do It All (Without Slowing Down)

Cresta leveled up site speed, design flexibility and marketer sanity (in record time). Find out how.

Webinar

Dec

[EIS Webinar] Beyond the Pilot: Why Most GenAI Projects Fail to Scale and How to Become One of the Success Stories

Move from experimental projects to integrated solutions that drive strategic decision-making.

Webinar

On demand

From Manual to Magical: How AI Transforms CX Teams

Learn how to replace manual support processes with automation that actually delivers.

Watch Now

Webinar

On demand

How to Build a Solid Knowledge Foundation for AI Success

See how leading brands keep their AI honest, compliant and actually helpful.

Watch Now

Webinar

On demand

Fix the Content Bottleneck: Build a Better WebOps Strategy

Content stalled? Dev overloaded? You’re not the only one. Learn how streamlined WebOps bridges the publishing gap.

Watch Now

Webinar

On demand

Beyond Storage: Smarter Content, Bigger Impact with DAM + AI

Discover how the DAM + AI duo makes content smarter, stronger and more accessible.

Watch Now

Webinar

Dec

Rebrand. Migrate. Optimize. How to Do It All (Without Slowing Down)

Cresta leveled up site speed, design flexibility and marketer sanity (in record time). Find out how.

Webinar

Dec

[EIS Webinar] Beyond the Pilot: Why Most GenAI Projects Fail to Scale and How to Become One of the Success Stories

Move from experimental projects to integrated solutions that drive strategic decision-making.

Webinar

On demand

From Manual to Magical: How AI Transforms CX Teams

Learn how to replace manual support processes with automation that actually delivers.

Watch Now

In addition to red teaming, keeping humans in the loop can also help ensure that high-quality training data gathered from various cultural and linguistic backgrounds is part of your AI model. Subject-matter experts can also fine-tune large language models for specific tasks and evaluate quality metrics like accuracy, coherence and tone in a more effective way than automated testing.

Testing the Unknown

The number of gen AI apps and features released into the market in just the past WEEK demonstrates how quickly this industry is moving, and how existing pressures to release quickly are unlikely to let up.

In the face of many unknowns and expanding risks, red teaming, quality assurance and testing with experts and end users are just some of the techniques organizations can use to source the right data, optimize models and ensure that AI features are safe, reliable and being used as intended.

fa-solid fa-hand-paper Learn how you can join our contributor community.