Define and measure quality in AI-powered systems

Aug 14

An image depicting stress on the fabric of society

Companies around the world are excited about the possibilities of using generative AI. Most are currently exploring how AI-powered software like large language models (LLMs) could power their business.

I wrote a blog recently about quality and AI-powered applications. In addition to productivity tools, generative AI software can significantly affect how business applications serve customers. Companies seek to get ahead of the curve with personalized offerings to their customers and advanced data analysis among other possibilities.

We at Hidden Trail believe quality is an area that will make or break the business capabilities of AI-powered products. Incidents happen all the time. Areas of possible negative impact for business vary from brand-damaging catastrophes to expensive maintenance of applications when AI components' functionality changes without proper notice from service providers.

Due to the nature of generative AI software and the fact that we don’t yet fully understand it, companies need to build proper guardrails, test frameworks, and measurement systems to understand and control their AI-powered software components. And it’s not just companies, quite honestly the fabric of our whole society is at stake here.

This blog is about practical methods for measuring the outputs of generative AI components in software applications. You can use this information to consider how to ensure quality when introducing intelligent software components for your business case.

Observations of current trends

Right now, we see a whole lot of product managers getting requests from management: “We want a wow-product with AI, please provide it ASAP”. The drivers come from stockholders instead of users. A majority of these products would probably have done better with investment in user experience and service design instead of hastily introducing a large language model into their architecture. Not to mention the risk posed by a black box component in the application! The cutthroat timelines don’t have space for building proper business cases, let alone test strategies and risk management plans.

In essence, we are looking at thousands of proof-of-concept and exploration projects in AI-powered application development. Companies are asking “Could this be a cool thing?” instead of “What does it take to make this product ready for end-user consumption?”. Topics around quality, measurement and ability to scale are already knocking on companies' doors after the PoC -phase.

A whole lot of people will start measuring AI-powered applications very soon. First, it will be cost-balanced with performance: How quickly can we serve a growing user base of our application with minimal investment in API -tokens? While doing a comparison between monthly spending in AI as a service and the performance of your AI-powered application is easy, there are several other factors in play in the financial considerations of the intelligent application business.

After the initial discussion about cost vs performance, we do believe that people start to ask: What does quality mean in our business case? What are the benefits and risks of generative AI, and how do we control them? How do we steer initiatives that implement our AI strategy and capabilities?

Model selection & prompt engineering

When it comes to measuring systems with AI-powered components, you need to make a separation between model evaluation and prompt evaluation. The former is about the overall performance of the large language model and version you are choosing to use, and the latter is about the prompt effectiveness and design based on what we can observe from the LLM output quality.

When planning agent architectures (another complete area to explore) in generative AI, or just choosing which LLM model is good for your case, evaluating models is a central issue. This is an area in which one should consider consulting machine learning and model experts. Choosing the right model for the right job is central to hitting the right mark for cost, performance and maintainability. It is important to note, that evaluating and updating models is inevitable. New models are released every month and a company leveraging these capabilities must have a strategy to handle this topic.

This blog leans more towards prompt effectiveness measurement. Quality and business analysis go hand in hand in this area. The insights we get from exploring and building test libraries for our chosen models assist in designing the best possible input methods for our application, as well as the format and content of the output we want to feed back to the users. Engineering prompts is easier than model training and fine-tuning and can be done by people who are not necessarily machine learning and artificial intelligence engineers.

Measurement focus

At Hidden Trail, we are interested in several aspects of quality in application development. In this context, I’ll list a few high-level questions, which are central to any measurement strategy.

\ How to ensure that our applications impact end users' lives positively over time?
\ How to satisfy business requirements with good-quality software?
\ How to make technical and practical decisions in development with the help of timely, reliable data?

Also, consider these key points to enhance your measurement landscape.

\ Multifaceted approach: In addition to accuracy and performance, consider relevance, consistency and ethical measurements.
\ Business alignment: Tie performance to clear business KPIs like conversion and retention rates.
\ Measurement techniques: Actively develop human assessment methods in addition to automated evaluation.
\ Negative and edge cases: Build libraries of bad output and use them to gain insight into benchmarking and risk management capabilities.

You can’t really see into the reasoning of evaluation algorithms.

The way to learn how well evaluation tools work for you is by testing them out and getting comfortable with using them.

Evaluation metrics

You can choose existing evaluation metrics provided by numerous tools available. Eventually, companies will likely create their custom metrics as they gain experience in using the tools and their business cases mature. Here are some of the common examples of measurements:

\ Conciseness
\ Coherence
\ Clarity
\ Toxicity
\ Bias
\ Hallucination
\ Summarization
\ Safety for children
\ Relevance
\ Precision
\ Originality
\ Engagingness
\ Medical language checker
\ Financial regulation checker

What measurements you should use is entirely up to the business case and characteristics of your industry and the application under development. If you are planning a friendly advice bot for a teen magazine, you have a completely different set of risks than with a summarization assistant for academic papers. In this example, the way you configure toxicity output management in your LLM makes a world of difference.

Measuring in practice

Once you have a fairly good idea of what your business case requires and which evaluation measurements support the case, you should start exploring the capabilities of the tools available. Through exploration and testing, you will figure out the best way to gain insight from the AI -provided outputs in your application. Some of the solutions we’ve been exploring are listed below:

\ DeepEval
\ OpenAI evals
\ Root Signals
\ Langsmith
\ Promptfoo
\ Haystack

Most of these tools are easy to set up and use. You can for example create question and answer -pairs that you know to be true and know to be false. Using these as benchmarks, you can feed the tool outputs to the same questions from your AI-powered system. By observing the results and configuring the measurement sensitivity thresholds, you will be able to start building a framework suitable for your needs. There is naturally a lot more to this process, but this simple setup is something you can start with.

The tools do share a similarity: You can’t really see into the reasoning of evaluation algorithms. The way to learn how well evaluation tools work for you is by testing them out and getting comfortable with using them. As you do this, you will eventually find a development and testing stack that fits your needs.

What next?

To conclude, how you will approach your AI-powered applications' production readiness has many twists and turns. For quality-oriented professionals, this is a feast of learning and exploration opportunities. Outcomes of LLMs are non-deterministic, meaning they can change from one prompt to another, even if the prompt is exactly the same. Changing a model or its version can have drastic impacts on your user-facing output. We must explore novel ways of testing these components in our business applications and guard our companies against significant risk.

While there is risk and uncertainty, the nature of these AI-powered tools is also their strength: Humans like to feel that they are understood and that the systems they interact with feel organic. As we go forward in the evolution of generative AI, we must learn how to harness these positive aspects and control the negative ones.

The Hidden Blog

QAstrategystrategiaQualitymeasuremetricsAI

Tuomas Leppilampi