Why AI Testing Is Crucial To Success

Scott Clark is the Cofounder and CEO of Distributional, which helps make AI safe, secure and reliable through adaptive testing.

getty

Generative AI is one of the most exciting opportunities today, and 2025 is shaping up to be a pivotal year, with 67% of early adopter organizations planning to increase their investments. Executives are eager to see returns, but many GenAI projects are dying on the vine. According to Gartner, nearly 1 in 3 of these GenAI projects will be abandoned after proof of concept by the end of 2025.

What’s halting progress? While it is easy to cite issues in performance, the real problem is actually much larger and more complex. It’s the lack of confidence in the behavior of AI over time.

Closing this confidence gap will require a new approach to proactively and adaptively test these applications to ensure desired behavior. Taking an adaptive, behavioral approach to testing AI can ensure enterprises have full confidence in their production AI applications while accelerating their pace of innovation.

The AI Confidence Gap

Overfocusing on performance is something I can relate to. My first company, SigOpt, was focused on helping some of the most sophisticated organizations in the world optimize their complex traditional AI and ML models using Bayesian optimization so that they could squeeze the most performance possible out of them.

After Intel acquired SigOpt in 2020, I led the AI and High Performance Computing team for Intel’s Supercomputing Group. It was there that I realized that despite SigOpt’s success, I was focusing on the wrong problem. Worse, people were starting to make the same mistake I did with these more powerful GenAI systems.

People don’t stay up at night wishing they could overfit an eval function by another half a percent. They’re worried their model will go off the rails and cause harm to their business by not behaving as desired.

The real pain with enterprise AI isn’t performance—it’s the AI confidence gap.

No matter how many manual spot checks or performance benchmarks are run, organizations simply don’t have the confidence that their AI applications will actually behave as desired in production, preventing them from pursuing higher-value AI use cases.

Testing has provided confidence in software behavior for decades, so why does it break down with AI?

Why Traditional Testing Falls Short In The Age Of AI

In traditional testing, code is static and engineers look for bugs by asserting that a specific input always returns a specific output. But AI applications are non-deterministic—the same input can return many different potential outputs. For testing AI, this means they can’t just rely on fixed datasets tied to specific outputs (also known as golden datasets) and instead need to be able to analyze the distribution of outputs and behaviors of the app.

Plus, AI applications are constantly shifting, with changes in usage, updates to prompts, new underlying models or shifting dependencies on upstream APIs, pipelines and services. Proper AI testing needs to adaptively keep up with these changes and inherent model non-stationarity.

AI applications are only getting more complex, especially with agents. Undesired behavior can propagate throughout the interconnected systems and be challenging to trace back to the original source. AI testing needs to look at the entire application, including intermediate data, not just performance metrics on input/output pairs.

Traditional testing isn’t capable of addressing these characteristics of AI applications. This is why I often hear from customers that AI is impossible to test, so they push it live and hope for the best.

The Limitations Of Evals And Observability

If traditional testing doesn’t work, what are teams doing instead?

During the development process, teams often rely on vibe checks to get an app to “good enough” performance. Vibe checks are inherently subjective and don’t scale. They also mask behavioral issues rather than guide teams to the understanding required to quantify and resolve underlying issues.

As teams mature, they may define thresholds on performance using evals. But performance metrics alone will never capture the full picture and will miss more subtle shifts in behavior. When there is a performance drop, teams don’t have enough information to understand what is causing the change and resolve the issues. These solutions are too incomplete, limited and static for AI systems.

A More Adaptive Approach To AI Testing

Instead, AI testing needs to focus on the entirety of app behavior, not just performance. By taking into account distributions of all behavioral properties, users get a more complete definition of desired behavior over time. By identifying macro behavioral changes and tying those to the testable quantitative properties that are causing the changes, they can further refine this definition over time. Ultimately, this leads to an adaptive testing methodology to understand where and how behavior shifts so they can catch and resolve underlying issues.

With AI, model and production usage will always change. Unlike traditional software, teams are not climbing a static peak toward test coverage. Instead, they need to adaptively surf a wave of behavioral test depth as business needs and usage change over time to stay confident.

Bridging The AI Confidence Gap With Adaptive AI Testing

Now is a fun time to be in AI. The opportunity feels limitless, especially in the enterprise.

But there is a rub. These AI systems can be either powerfully good or bad. As I mentioned earlier, nearly 1 in 3 AI applications are abandoned after development, and even fewer make it into production. In production, these systems can achieve world-class performance one day and harmfully poor behavior the next—exposing companies to financial, reputational or, increasingly, regulatory risk.

An adaptive approach to AI testing helps you define and ensure desired behavior continuously, giving you the confidence to productionalize more and fully realize AI’s potential.

How can you get started? First, collect usage data so you can be ready to test even before production. Next, implement an adaptive testing solution to understand and quantify the desired behavior. And test these applications for behavioral shifts in production, not just during development.

I’m excited for 2025 to be the year of productionalizing and getting real business value out of AI through AI testing.

Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?

Forbes