Enterprises Are Building a New AI QA Playbook

In my last VKTR article, I discussed how agentic AI brings both autonomy AND risk, and that we’ll need new quality assurance strategies and testing to provide a strong, scalable, trustworthy foundation.

Fortunately, the industry isn't standing still. I see progress and investment in a number of new or enhanced strategies.

LLMs Trained to Increase Relevancy and Accuracy — While Reducing Bias
More Organizations Understand Rigorous Security Testing Is Table Stakes
Expertise Makes Large Language Models Better
Roller Coaster or Ferris Wheel?

LLMs Trained to Increase Relevancy and Accuracy — While Reducing Bias

This is done by keeping humans in the loop, both from a data sourcing AND testing perspective. You may have noticed that I really enjoy writing about this topic.

This notion that we can “set it and forget it” with AI agents isn’t realistic for many reasons. While a properly trained agent is truly intelligent and adaptable, it doesn’t have the human capacity for judgement, which is critical with unexpected or imperfect situations that arise all the time in the real world. Hybrid evaluations that combine human-in-the loop and automated assessments will give organizations a comprehensive approach to testing that can be adapted to their own unique business cases.

More Organizations Understand Rigorous Security Testing Is Table Stakes

If there’s any topic that I like writing about more than “humans in the loop,” it would have to be red team testing. I referenced this concept in my very first VKTR article in 2024, and in three other posts as well.

Red team testing, its legacy in cybersecurity and effectiveness in securing AI systems are not new concepts. However, red teaming is not yet a requirement of AI development for most internal teams that are undoubtedly balancing aggressive launch deadlines with limited expertise and resources. A 2025 report found only 33% of organizations leverage this QA best practice.

But, cutting corners on quality can have severe consequences down the road. We’re seeing our clients (some of the world’s largest brands) embed red team expertise and execution into the SDLC, understanding that it is the most efficient way to guard against post-launch headaches and very real costs. Continuing to prioritize sophisticated safety evaluations and red teaming is starting to become an essential part of the AI development cycle for all companies.

Expertise Makes Large Language Models Better

I wrote previously about how subject-matter experts can help fine-tune AI for specific use cases and evaluate accuracy, tone and coherence. (Note: ANOTHER human-in-the-loop plug). We’re seeing many organizations improve their AI data with the help of experts and generalists alike.

In a recent example, a financial software company approached my company to evaluate and test its model for safety, accuracy and potential harms. The company required financial experts to test the model against various criteria and provide feedback. As a result, we ended up identifying and running moderated studies with dozens of CFOs who were willing to provide critical feedback, insights and issues from their own data generated by the agent being tested.

Roller Coaster or Ferris Wheel?

AI strategies will continue to evolve, and more changes are on the horizon. Are we boarding another roller coaster, or are we in for a gentler ride?

Businesses that want to ride out these twists and turns should start with the strongest foundation (a diverse human dataset), embed rigorous testing throughout the SDLC and use the best experts (on-hand or within reach) to validate and optimize their findings.

Learning Opportunities

Webinar

Mar

Content Leaders Collective: Navigating Content Decisions at Scale

Discover how content leaders are modernizing content operations, avoiding costly missteps and preparing for scale and AI.

Webinar

On demand

Content Strategy Leaders Live: Scaling for Speed, Complexity and AI in High Tech

A candid roundtable on how high-tech leaders are rethinking content at scale.

Watch Now

Webinar

On demand

Do More with Less: Modernizing the Cloud Contact Center for 2026

Learn how to leverage cloud platforms without adding a single hire to personalize every customer interaction.

Watch Now

Webinar

Complex, internal combustion engine or fine clockwork.

On demand

Cut the Noise: Deploying AI That Actually Moves the Needle

Learn how to turn AI experimentation into concrete revenue operations.

Watch Now

Webinar

On demand

Ditch the Desk Phones: How Modern Teams Drive AI-First Communications

Find out how one team finally pulled the plug on a legacy phone system. And built something smarter.

Watch Now

Webinar

On demand

Rebrand. Migrate. Optimize. How to Do It All (Without Slowing Down)

Cresta leveled up site speed, design flexibility and marketer sanity (in record time). Find out how.

Watch Now

Webinar

Mar

Content Leaders Collective: Navigating Content Decisions at Scale

Discover how content leaders are modernizing content operations, avoiding costly missteps and preparing for scale and AI.

Webinar

On demand

Content Strategy Leaders Live: Scaling for Speed, Complexity and AI in High Tech

A candid roundtable on how high-tech leaders are rethinking content at scale.

Watch Now

Webinar

On demand

Do More with Less: Modernizing the Cloud Contact Center for 2026

Learn how to leverage cloud platforms without adding a single hire to personalize every customer interaction.

Watch Now

If there’s one thing that is true about AI, it’s that its users are ultimately human and we should all strive as leaders to minimize AI’s risks (inaccuracy, bias, toxicity) and maximize AI’s incredible value (greater efficiency, higher productivity).

fa-solid fa-hand-paper Learn how you can join our contributor community.

Table of Contents

LLMs Trained to Increase Relevancy and Accuracy — While Reducing Bias

More Organizations Understand Rigorous Security Testing Is Table Stakes

Expertise Makes Large Language Models Better

Roller Coaster or Ferris Wheel?