In my last VKTR article, I discussed how agentic AI brings both autonomy AND risk, and that we’ll need new quality assurance strategies and testing to provide a strong, scalable, trustworthy foundation.
Fortunately, the industry isn't standing still. I see progress and investment in a number of new or enhanced strategies.
Table of Contents
- LLMs Trained to Increase Relevancy and Accuracy — While Reducing Bias
- More Organizations Understand Rigorous Security Testing Is Table Stakes
- Expertise Makes Large Language Models Better
- Roller Coaster or Ferris Wheel?
LLMs Trained to Increase Relevancy and Accuracy — While Reducing Bias
This is done by keeping humans in the loop, both from a data sourcing AND testing perspective. You may have noticed that I really enjoy writing about this topic.
This notion that we can “set it and forget it” with AI agents isn’t realistic for many reasons. While a properly trained agent is truly intelligent and adaptable, it doesn’t have the human capacity for judgement, which is critical with unexpected or imperfect situations that arise all the time in the real world. Hybrid evaluations that combine human-in-the loop and automated assessments will give organizations a comprehensive approach to testing that can be adapted to their own unique business cases.
More Organizations Understand Rigorous Security Testing Is Table Stakes
If there’s any topic that I like writing about more than “humans in the loop,” it would have to be red team testing. I referenced this concept in my very first VKTR article in 2024, and in three other posts as well.
Red team testing, its legacy in cybersecurity and effectiveness in securing AI systems are not new concepts. However, red teaming is not yet a requirement of AI development for most internal teams that are undoubtedly balancing aggressive launch deadlines with limited expertise and resources. A 2025 report found only 33% of organizations leverage this QA best practice.
But, cutting corners on quality can have severe consequences down the road. We’re seeing our clients (some of the world’s largest brands) embed red team expertise and execution into the SDLC, understanding that it is the most efficient way to guard against post-launch headaches and very real costs. Continuing to prioritize sophisticated safety evaluations and red teaming is starting to become an essential part of the AI development cycle for all companies.
Related Article: AI Risks Grow as Companies Prioritize Speed Over Safety
Expertise Makes Large Language Models Better
I wrote previously about how subject-matter experts can help fine-tune AI for specific use cases and evaluate accuracy, tone and coherence. (Note: ANOTHER human-in-the-loop plug). We’re seeing many organizations improve their AI data with the help of experts and generalists alike.
In a recent example, a financial software company approached my company to evaluate and test its model for safety, accuracy and potential harms. The company required financial experts to test the model against various criteria and provide feedback. As a result, we ended up identifying and running moderated studies with dozens of CFOs who were willing to provide critical feedback, insights and issues from their own data generated by the agent being tested.
Roller Coaster or Ferris Wheel?
AI strategies will continue to evolve, and more changes are on the horizon. Are we boarding another roller coaster, or are we in for a gentler ride?
Businesses that want to ride out these twists and turns should start with the strongest foundation (a diverse human dataset), embed rigorous testing throughout the SDLC and use the best experts (on-hand or within reach) to validate and optimize their findings.
If there’s one thing that is true about AI, it’s that its users are ultimately human and we should all strive as leaders to minimize AI’s risks (inaccuracy, bias, toxicity) and maximize AI’s incredible value (greater efficiency, higher productivity).
Learn how you can join our contributor community.