You can't ever dream up everything in the first place. You don't really know what you want until you see it with these LLMs, so you got to be flexible, have to look at your data.
Hamel Husain & Shreya Shankar
AI Evals Instructors, Maven Course Creators
25 quotes across 1 episode
Why AI evals are the hottest new skill for product builders
You're asking the judge to do one thing, evaluate one failure mode, so the scope of the problem is very small and the output of this LLM judge is pass or fail. So it is a very, very tightly scoped thing that LLM judges are very capable of doing very reliably.
Usually, I'll spend three to four days really working with whoever to do initial rounds of error analysis. This is one-time cost. Once I figured out how to integrate that in unit tests, or I have a script that automatically runs it on samples, I would say maybe 30 minutes a week after that.
The goal is not to do evals perfectly, it's to actionably improve your product.
You don't want to skip this step. The reason I'm kind of spending so much time on this is this is where people get lost. They go straight into evals like, 'Let me just write some tests,' and that is where things go off the rails.
Before you release your LLM as a judge, you want to make sure it's aligned to the human. A lot of people stop there and they say, 'Okay, I have my judge prompt. We're done.' Don't do that, because that's the fastest way that you can have evals that don't match what's going on, and when people lose trust in your evals, they lose trust in you.
I think a lot of people prematurely do A-B tests, because they've never done any error analysis in the first place. If you're going to do A-B tests and they're powered by actual error analysis as we've shown today, then that's great, go do it. But if you're just going to do them based on what you hypothetically think is what is important, then I would encourage people to go and rethink that.
To build great AI products, you need to be really good at building evals. It's the highest ROI activity you can engage in.
Everyone that does this immediately gets addicted to it. When you're building an AI application, you just learn a lot.
Put your product hat on and get into, is this really good? That's where the fun part is. You're looking at data. It's like, okay, you're annotating things. Actually, I was just looking at a client's data yesterday, the same exact process. It's a lot of fun, actually.
Put your product hat on and get into, is this really good? That's where the fun part is.
For me, between four and seven. It's not that many, because a lot of the failure modes can be fixed by just fixing your prompt. You shouldn't do an eval like this for everything, just the pesky ones.
We recommend doing at least 100 of these. Keep looking at traces until you feel like you're not learning anything new.
You should start with some kind of data analysis to ground what you should even test, and that's a little bit different than software engineering where you have a lot more expectations of how the system is going to work. With LLMs, it's a lot more surface area. It's very stochastic, so you kind of have a different flavor here.
Just write down the first thing that you see that's wrong, the most upstream error. Don't worry about all the errors, just capture the first thing that you see that's wrong, and stop, and move on.
Keep looking at traces until you feel like you're not learning anything new. There's actually a term in data analysis and qualitative analysis called theoretical saturation.
The top one is, 'We live in the age of AI. Can't the AI just eval it?' But it doesn't work.
You're never going to know what the failure modes are going to be upfront, and you're always going to uncover new vibes that you think that your product should have. You don't really know what you want until you see it with these LLMs.
What we usually find when we try to ask an LLM to do this error analysis is it just says the trace looks good because it doesn't have the context needed to understand whether something might be bad product smell or not.
People are making dashboards on this, and I think that's incredible. I think the products that are doing this, they have a very sharp sense of how well their application is performing, and people don't talk about it, because this is their moat.
People are not going to go and share all of these things, because it makes sense. If you are an email-writing assistant, and you're doing this and you're doing it well, you don't want somebody else to go and build an email-writing assistant and then get you out of business.
Most people don't have that skill right now. People who work at Anthropic are very, very highly skilled. They've been trained in data analysis or software engineering or AI, and whatnot. You can get there, anyone can get there, of course, by learning the concepts, but most people don't have that skill right now.
Basic counting is the most powerful analytical technique in data science because it's so simple and it's kind of undervalued in many cases.
This is the same data science as before, and I think that's what's causing the confusion is, 'Hey, we need data science thinking,' and AI product is helpful to have that thinking in AI products like it is in any product is my take on that.
A lot of people go straight to this agreement. They say, 'Okay, my judge agrees with the human some percentage of the time.' That sounds appealing, but it's a very dangerous metric to use, because a lot of times, errors, they only happen on the long tail and they don't happen as frequently.