LLM evaluations will continue to be underserved, and what we can do about it.

Why some productized agent systems will fail, and others won't. My two cents for today.

Jun 17, 2025

What even is an LLM evaluation?

white and black card on brown wooden table

From OpenAI’s evaluation documentation:

Evaluations (often called evals) test model outputs to ensure they meet style and content criteria that you specify. Writing evals to understand how your LLM applications are performing against your expectations, especially when upgrading or trying new models, is an essential component to building reliable applications.

This definition is a good technical starting point, but it can be readily expanded to include other elements. Rather than just content or style requirements, you might want to include other things such as:

Task Success and Functional Correctness

Many LLM applications are goal-oriented. Whether it’s writing code, summarizing a conversation, generating a response to a customer query, or proposing a supply chain decision, your evaluation criteria should include whether the output actually does the job:

Is the output factually accurate?
Is it actionable, complete, and relevant to the input?
Would a human user consider this output useful?

Guardrails

In practice, LLM evals can serve as:

Regression tests, to ensure you haven’t broken existing behaviors when tweaking prompts or upgrading models.
Model selection tools, helping compare how different LLMs perform under real use cases.
Feedback collectors, especially when instrumented in your app to gather user reactions and real-world errors.

Most people don't even deploy agents with evals (and that's a problem)

Let’s be blunt: a shocking number of LLM agents get shipped without any real evaluation framework at all. Working in San Francisco there is a ship fast mentality with many startups in the agent space. As these scale up, try to sell to enterprises, there is a rude awakening waiting for them. People want to have confidence that what you are selling them works. What a concept.

Teams invest weeks tuning prompts, wiring up tools, designing clever planners, and orchestrating multi-step workflows—but then… they hit "deploy" and rely entirely on vibes.

No automated checks. No regression tests. No system for knowing whether today’s outputs are better or worse than last week’s. If something breaks, they notice because a user complains—or worse, doesn’t.

This isn’t just a technical gap—it’s an organizational blind spot.

LLM behavior is probabilistic, not deterministic. Without evals, you have no baseline and no ability to measure drift.
Agentic systems introduce combinatorial complexity. More steps, more failure modes, more surface area. You need structure to manage it.
Tooling is evolving fast, and models change under your feet. If you’re not evaluating, you’re not even in the loop when the ground shifts.

And still, many teams treat evals like a “nice-to-have” or an afterthought. Why? Because they think of them as overhead. But the irony is: not evaluating is way more expensive in the long run. You pay for it in bug hunts, brittle UX, and eroded customer trust.

The best teams don’t just ship evals—they design for them from the start. They define success before writing prompts. They build feedback capture into the app experience. And they track eval metrics like product KPIs.

Because at some point, it stops being about the model’s capability and starts being about your system’s reliability.

Resources to cover units of evaluation measure

When you implement any evaluation, you have a goal in mind for the evaluation but you must also measure it. Here are some resources as I won’t get into too much detail about measurement here :)

Automated Tests

String Matching / Regex: LangChain Output Parsers
Token Usage: OpenAI Token Counter (tiktoken)
Model-Based Scoring (BLEU, ROUGE, etc.): Hugging Face Metrics
Semantic Scoring: BERTScore

Human-Labeled Goldsets

Eval Structure + Setup: OpenAI Evals GitHub
Rubric Design Examples: Humanloop Guide
Enterprise Eval Templates: Scale AI Eval Page

Model-Graded Evals (LLMJ)

Judge Prompt Examples: Anthropic Constitutional AI
Eval Tools: TruLens

Live Feedback (Production)

Observability in LangChain: LangChain Observability

Going through these resources should be a good primer for what exists today or as a refresher.

Jakub’s Substack

Discussion about this post

Ready for more?