How to Evaluate AI Foundation Models? Don't!

There is a lot of writing and work spent on evaluating models in some ways, comparing them and picking the "best" one. I am going to argue why it's not the best use of resources for most businesses to deliver mind-boggling value to the business.

And what matters for that is to pick use cases that are both deliverable today and high value to the business. It’s the 10x augmentation that will make a major difference to the business and prove AI is the powerhouse it is expected to be.

Pick a Model But Do Not Lock On It
Think Big, Yet Deliverable
High-Value Deliverable Use Cases
How to Deliver? Get Cycles, Iterate Fast
Think Go Live: Architecture Matters
The Way Forward

Pick a Model But Do Not Lock On It

But before we start on use cases, you still have to pick a model …

High-quality models (Claude 3.5, GPT-4, Gemini 1.5) have largely converged today and offer similar capabilities for most use cases. Plus, they regularly interweave, depending on their release cycle, training period, etc. It is also expected the models will continue to converge, as the research community is focusing on the same problems and the same data. The models are also converging in terms of capabilities, as they are all based on the same architecture and similar training data.

A good way to go about selection is to base it on hard facts and constraints:

How easy it is to access based on your company’s policies, data privacy rules, etc.? Starting with the cloud provider you are currently most using is probably a good place to start. Runs on AWS? Let’s go with Claude. On GCP? Let’s start with Gemini. Or if on Azure, then GPT-4 it is.
Are the languages you need supported by the model? This can be an important driver, as all models aren’t trained on the same set of languages
What is the expected input context you need? Models have various capabilities in how much they can take in for a single task. Based on the amount of data you will need to deliver on your business case, this can be a driver for choice.

And last but not least, make sure you do not lock on a model: it must be a run-time decision. To benefit from the rapidly evolving model landscape, it is critical you do not lock on a specific model but you design your project to easily switch and upgrade, so your processing can be moved onto higher-duality models as they come online without disrupting your service.

Think Big, Yet Deliverable

The more important task, as you are starting or advancing your AI journey, is to select the right use cases to deliver value to the business and convince the business of the value of AI. Too many AI initiatives fail to deliver because either the use case was not deliverable, hence producing mediocre results, or not ambitious enough, delivering low value to the business.

As with any project, it is important to understand the business domain, the nature of the processes, how teams work, interact and how information is pulled into the processes.

There are typical characteristics of what constitutes a good use case:

Compression vs. Expansion

LLM are great compression machines for language and meaning. They can take in a large amount of content and compress it into a small output based on your instructions (constraints). They typically work best when you have clear constraints, not general summaries, for example. Therefore, use cases where you can feed in a lot of context and have clear, structured instructions, will perform very well.

Structured Output

Counterintuitively, we’ve observed LLMs deliver much better results when the output is more formal, structured. Be that a good JSON data structure or a formal document, like a review document, a scorecard, etc. Use cases involving structured output tend to do better than free-form content.

Short-Form Content

LLMs have a small output maximum in a single run, typically eight to 16 pages of standard english text — 4,000 to 8,000 tokens with a more recent one going to 16,000. But regardless of the maximum output, we are observing that the longer the content, the lower the quality of the output. This circles back with the first characteristics: compression vs. expansion. There are, however, techniques to deliver long-form content, but they require more sophisticated approaches and systems. So totally suitable down the road, but maybe not early in your journey.

A Simple Framework to Evaluate

To help businesses evaluate potential AI use cases, here's a framework of questions to consider:

Alignment with business objectives

Does this use case directly support our core business goals?
How much the efficiency or opportunity would increase if we improve it?

Measurable impact

Can we clearly define success metrics for this use case?
How will we measure the ROI of this LLM implementation?

Data availability and quality

Do we have the necessary content to feed the model?
Is it easily available?

Technical feasibility

Is it compression vs. expansion?
How does it score on the above criteria?

Scalability

Can this solution be scaled across the organization if successful?
How will it integrate with our existing systems and processes?

Stakeholder buy-in

Who are the key stakeholders for this use case?
How can we ensure their support and engagement throughout the process?
What do they need to see to benefit from the project?

By systematically working through these questions, you can evaluate potential AI use cases more effectively. This framework helps ensure you're not just chasing the latest AI trend, but focusing on implementations that will deliver real value to your business.

High-Value Deliverable Use Cases

Based on the previous aspects, we can define broadly several no-brainer use cases that can deliver value to any business. Here are some high-value use cases you can readily explore for your business.

Information Extraction

Take unstructured content and transform it into structured data to enable insights. All businesses have unstructured content that isn’t leveraged, and if it were, it could massively improve the business. Think of all the files stored on Sharepoint, Google Drive, Box or locked up within corporate software systems. Examples include licensing contracts, maintenance reports, HR interviews, performance reports, field visits, write-ups, support tickets, customer reviews, postmortems, etc.

By using this information-rich content and turning it automatically into structured data, you can start generating insights for the business at a depth that has never been available, often unlocking new opportunities of growth or efficiencies.

Content Review

Another wide field of application is content reviews: take unstructured content in, apply some guideline or formalized knowledge and decide if the content is compliant or not, flag issues and areas of improvement. It’s a wide category of use cases that is typically present in all businesses and are a core part of key business processes: contract review, licensing approval, overage billing approval, documentation review based on product specs/code, application review, code review, completeness verification, etc.

There are thousands of different business-specific use cases that are about content review. The key is identifying the tasks that are highly similar, where there are clear and documented guidelines on how to review the content and where the output is deterministic based on input.

Learning Opportunities

Webinar

Dec

[EIS Webinar] Beyond the Pilot: Why Most GenAI Projects Fail to Scale and How to Become One of the Success Stories

Move from experimental projects to integrated solutions that drive strategic decision-making.

Webinar

On demand

Rebrand. Migrate. Optimize. How to Do It All (Without Slowing Down)

Cresta leveled up site speed, design flexibility and marketer sanity (in record time). Find out how.

Watch Now

Webinar

On demand

From Manual to Magical: How AI Transforms CX Teams

Learn how to replace manual support processes with automation that actually delivers.

Watch Now

Webinar

On demand

How to Build a Solid Knowledge Foundation for AI Success

See how leading brands keep their AI honest, compliant and actually helpful.

Watch Now

Webinar

On demand

Fix the Content Bottleneck: Build a Better WebOps Strategy

Content stalled? Dev overloaded? You’re not the only one. Learn how streamlined WebOps bridges the publishing gap.

Watch Now

Webinar

On demand

Beyond Storage: Smarter Content, Bigger Impact with DAM + AI

Discover how the DAM + AI duo makes content smarter, stronger and more accessible.

Watch Now

Webinar

Dec

[EIS Webinar] Beyond the Pilot: Why Most GenAI Projects Fail to Scale and How to Become One of the Success Stories

Move from experimental projects to integrated solutions that drive strategic decision-making.

Webinar

On demand

Rebrand. Migrate. Optimize. How to Do It All (Without Slowing Down)

Cresta leveled up site speed, design flexibility and marketer sanity (in record time). Find out how.

Watch Now

Webinar

On demand

From Manual to Magical: How AI Transforms CX Teams

Learn how to replace manual support processes with automation that actually delivers.

Watch Now

Content Repurposing

Similar to content generation, but this use case category relies on existing information. Not pure content creation (like this article), but content generation based on a large input context, reference date or even unstructured content (meeting notes, design specs, campaign briefs, slack conversations, etc.). Product release assets are good examples of this use case, such as product documentation, how-tos and introduction guides.

How to Deliver? Get Cycles, Iterate Fast

A major contributor to success is to be in a position to iterate fast on the project. Avoid spending time on low-level details of the LLM, but get in a position where iterating on the prompts, data model and input context is easy and quick. The easier it is, the more iteration cycles you get, the more cycles you get, the more options you test and converge to the best outcome. Too many projects today are bogged down by technical details, because the LLM stack is still rapidly evolving. But there are solutions and vendors to help with that.

A big part of the success and pace to value will lie in how many interactions your team is able to get to deliver what matters to the business.

Think Go Live: Architecture Matters

And the last major point to consider is how to go live. Too many AI initiatives are stopped before they can go live, mired in endless scripts, experiments and unscalable models. AI or not, data privacy matters, IT security matters and data flows matter. Have a plan to go live from day one and leverage solutions that enable buy-in from your IT security team.

Whatever the plan, make sure you have a plan to go live, so that after convincing the business of the value, you can deliver this value in production!!

The Way Forward

In conclusion, the key to unlocking LLM potential for your business isn't about chasing the latest and greatest model or knowing all the quirks they come with. It's about identifying high-value, deliverable use cases that can demonstrate GenAI's power to transform your operations. By focusing on practical applications, maintaining flexibility in your model selection and prioritizing rapid iteration, you can sidestep the pitfalls of endless evaluation and comparison.

Remember, the true measure of AI's success isn't found in benchmark scores or model rankings. It's in the tangible benefits it brings to your business — the insights uncovered, the processes streamlined and the value delivered to your customers. So stop obsessing over model evaluations and start asking the real questions:

What are our most important business bottlenecks today?
Where are people spending their time on repetitive cognitive tasks?
What would we need to accelerate those processes?

By shifting your focus from model comparisons to use case implementation, you'll not only accelerate your AI journey, but also position your organization to reap the rewards of this transformative technology. The future of AI in business belongs to those who can identify and solve real-world problems, not those who endlessly debate model specifications.

It's time to move beyond the hype and start delivering results. Your competitive edge in the AI era doesn't depend on having the "best" model — it depends on how effectively you can leverage AI to solve your unique business challenges. So what are you waiting for? The next big breakthrough for your business might be just one well-chosen use case away.

fa-solid fa-hand-paper Learn how you can join our contributor community.

Table of Contents

Pick a Model But Do Not Lock On It

Think Big, Yet Deliverable

Compression vs. Expansion

Structured Output

Short-Form Content

A Simple Framework to Evaluate

High-Value Deliverable Use Cases

Information Extraction

Content Review

Content Repurposing

How to Deliver? Get Cycles, Iterate Fast

Think Go Live: Architecture Matters

The Way Forward