Debugging and quality control for LLMs

How to setup evaluations to improve LLM outputs

Apr 23, 2025

LLMs are stochastic processes, so we cannot use the standard software techniques, which assume deterministic output, to make sure they work. We can instead setup evaluations, where we run product through a series of prompts, and have it evaluate the prompt itself based on criteria we setup. Here is how you would setup such a prompt:

Once upon a time, we’d call this a classifier, but basically we’re just using an LLM to do the same job. If you setup some kind of test harness, and can capture the prompts and results, you can run experiments and see what works better at producing the outcome you are looking for.

Aman Khan did a nice runthrough of this on Maven using Arise for anyone who wants to learn more.

You can see what a dashboard might look like:

Import your actuals as a csv or some other kind of file, and load it into the tool. Isolate the prompt variants you are looking for in your experiment, and then see what works better!

I don’t know how general purpose LLMs are, and how many of the basic models they can subsume (classifiers, regressions, cluster analyses, etc.) My guess is that it’s less the model type, and more whether the task consists of analyzing text associatively, which would play to LLMs strengths. I’d still run regressions the old fashioned way.

winterspeak

Discussion about this post