Tooling5 min read17 Jun 2026

Evals Are the New Unit Tests (And Most Teams Are Doing Them Wrong)

You can't ship a serious AI product on vibes. Evals are becoming the core engineering discipline of applied AI — and the moat nobody copies.

Written by northstar editorial·Updated 17 Jun 2026

Ask a team how they know their AI feature is good, and you'll learn everything about how seriously they're building. The unserious teams say "it feels good in testing." The serious ones show you a dashboard.

That dashboard is the eval suite, and in 2026 it has quietly become the central engineering discipline of applied AI. The teams that have it ship with confidence. The teams that don't are flying a plane with the instruments taped over.

Why "it works in the demo" stopped being enough

Traditional software is deterministic. Same input, same output, every time. A unit test that passes today passes tomorrow unless someone changes the code. That stability is why "it works on my machine" is a punchline — the bar is so much higher.

AI systems are not deterministic. The same prompt can produce different outputs. A prompt tweak that fixes one case silently breaks three others. A model update from the lab — which you didn't ask for and can't decline — shifts behavior across your entire product overnight. The "code" you depend on changes underneath you.

In that environment, manual spot-checking is malpractice. You cannot eyeball your way to confidence on a system with thousands of possible behaviors that mutate every time you touch a prompt. You need a way to measure quality that's repeatable, fast, and ruthless. That's an eval.

What an eval actually is

An eval is three things: a set of cases, a definition of good, and a scorer.

The cases are real inputs your system has to handle — ideally drawn from production, ideally including the inputs that have already burned you. The definition of good is your encoded judgment about what a correct or acceptable output looks like. The scorer is whatever assigns a number: an exact match, a rule, a regex, a similarity score, or another model acting as a judge.

Run all your cases through your current system, score them, and you have a number. Change a prompt, rerun, and you can see — instantly, quantitatively — whether you helped or hurt. That loop is eval-driven development, and it's as much of a step change for AI as test-driven development was for software.

The mistakes almost everyone makes first

Building a giant eval set before you have failures. Teams agonize over assembling a thousand pristine cases before shipping anything. Wrong order. Start with twenty cases that cover your known failure modes, ship, and grow the set every time production surprises you. Your eval suite should be a scar tissue archive, not a synthetic benchmark.

Measuring the wrong thing. A lot of evals score fluency, helpfulness, or some fuzzy quality score because those are easy to ask a judge model about. But your users don't care about fluency — they care about whether the answer was correct, whether the agent took the right action, whether the form got filled out right. Eval the outcome that matters to the user, even when it's harder to score.

Newsletter

Like this decoded read? Get the next one in your inbox.

One sharp take on the latest in AI for PMs, founders and operators. Every few days. Free.

Free forever. Unsubscribe anytime. No spam.

Trusting the judge blindly. LLM-as-a-judge is the workhorse of modern evals — you use a model to score outputs at scale. It's genuinely useful. It's also biased in known ways: it tends to prefer longer answers, it favors outputs that match its own style, and it can be inconsistent on edge cases. If you never validate your judge against human labels, you're measuring the judge's quirks, not your product's quality.

Where evals become a moat

Here's the part that doesn't get said enough. Anyone can call the same frontier model you call. The model is not your advantage. But your eval suite — the accumulated, domain-specific encoding of what "good" means in your particular problem space — is genuinely hard to copy.

A competitor can clone your prompts in an afternoon. They cannot clone two years of eval cases that capture every weird edge case your real customers threw at you, the judgment about what good looks like in your domain, and the regression history that lets you upgrade models fearlessly. That accumulated definition of quality is a compounding asset, and it's invisible from the outside, which is why it's defensible.

The teams treating evals as a chore are missing that they're building one of the few moats available in a world where the model is a commodity.

The model-upgrade test

Want to know if a team's evals are real? Ask what happens when a new frontier model ships.

The teams without evals greet a new model with dread. They have no way to know if switching will help or break things, so they either avoid upgrading (falling behind) or upgrade on faith (and discover the regressions in production, from angry users).

The teams with good evals greet a new model with a button press. Swap the model, rerun the suite, read the diff. Better on 80% of cases, worse on 12%, investigate those — done in an hour. The eval suite turns the terrifying, recurring event of a model update into a routine, measured decision. That difference compounds over years.

The takeaway

Evals are to applied AI what unit tests are to software, except more important, because the ground shifts under you constantly and your judgment about quality is the only fixed point.

Don't build a thousand-case benchmark before you ship. Build twenty cases from your real failures, wire up a scorer, and make "rerun the evals" the reflex before any change goes out. Validate your judge. Eval the outcome users care about. And recognize that the suite you're building isn't overhead — it's the encoded definition of quality that your competitors can't see and can't copy. In a world where everyone has the same model, the team that knows precisely what good means wins.

Frequently asked questions

What is an eval in AI product development? An eval is a repeatable test that scores your AI system's output against a definition of 'good' on a fixed set of cases. It's the AI equivalent of a unit test, except the output is non-deterministic and 'correct' is often a judgment call, so the eval encodes that judgment so you can measure it consistently across changes.

Do I need evals if I'm just using an off-the-shelf model? Yes. The moment you write a prompt, choose a model, or add retrieval, you've made decisions that affect quality — and any of them can silently regress when you tweak a prompt or the lab updates the model. Without evals you have no idea if your last change helped or quietly broke things.

Is LLM-as-a-judge reliable for scoring outputs? Reliable enough to be useful, not reliable enough to trust blindly. A judge model is great for scaling subjective scoring, but you must validate it against human labels on a sample, watch for known biases like favoring longer answers, and keep humans in the loop for high-stakes evals.

How many eval cases do I need to start? Far fewer than people think. Twenty to fifty well-chosen cases that cover your real failure modes beat a thousand random ones. Start with the cases that have actually burned you in production, and grow the set every time something breaks.

Want more sharp takes on AI tooling? Browse the AI Decoded archive.

Frequently asked

What is an eval in AI product development?

An eval is a repeatable test that scores your AI system's output against a definition of 'good' on a fixed set of cases. It's the AI equivalent of a unit test, except the output is non-deterministic and 'correct' is often a judgment call, so the eval encodes that judgment so you can measure it consistently across changes.

Do I need evals if I'm just using an off-the-shelf model?

Yes. The moment you write a prompt, choose a model, or add retrieval, you've made decisions that affect quality — and any of them can silently regress when you tweak a prompt or the lab updates the model. Without evals you have no idea if your last change helped or quietly broke things.

Is LLM-as-a-judge reliable for scoring outputs?

Reliable enough to be useful, not reliable enough to trust blindly. A judge model is great for scaling subjective scoring, but you must validate it against human labels on a sample, watch for known biases like favoring longer answers, and keep humans in the loop for high-stakes evals.

How many eval cases do I need to start?

Far fewer than people think. Twenty to fifty well-chosen cases that cover your real failure modes beat a thousand random ones. Start with the cases that have actually burned you in production, and grow the set every time something breaks.