AI Evals For Engineers & PMs: A Retrospective

Evals

LLMs

The initial cohort of the AI Evals for Engineers & PMs course is now in the books. Below are some of my thoughts on the course including why I’m recommending it to you and also some lessons learned along the way.

Author

Wayde Gilliam

Published

June 15, 2025

Let’s Begin

Me when Hamel first told me he was putting together this course …

Stepbrothers Mists Of Avalon GIFfrom Stepbrothers GIFs

Actually, this has been “me” for any lecture/course/workshop/etc… that Hamel is involved with ever since he put on the wonderful Mastering LLMs For Developers & Data Scientists with Dan Becker last year (also strongly recommended for folks that are exploring evals-informed finetuning).

My #1 criteria for evaluating any course is this: Is it fun?

But what does that mean you ask? Well, let’s make this retrospective a bit more interesting by answering that question in the context of lessons learned from the course itself.

What do I mean by “fun”?

The idea of making your intentions clear when talking to LLMs and building evals is a re-occurring theme throughout the course. When it comes to “fun”, that means we need an agreed upon definition, or “rubric” if you will, to really use it as an eval for determining what makes a course worth your time.

It could mean fun like “I was always laughing at their jokes and cracking up whenever someone posted a ‘look at your data’ meme”. It could mean “It was fun in the sense that I was entertained watching Hamel and Shreya in action”. It could mean a lot of things.

The question here is: If I’m building an evalauation rubric for “fun”, how could I formulate it in a way that would allow human and LLM judges to look at any course and determine if it would pass the “Wayde-Gilliam-Is-Fun-Eval”?

I will assume the role of “benevolent dictator” and answer that question by following the recommendations stated in chapter 4 of the course reader for building out your evals.

The “Fun” Eval

Chapter 4 encourages folks to “create a first version of the rubric [for determining what passes for a given eval].” This “includes a working definition of the evaluation criterion … and a few illustrative Pass/Fail examples”.

Here’s my initial “fun” rubric

A “fun” course is where participants learn how to apply valuable skills in their own work through informative lectures, detailed reading material, colloboration with other studens and experts in the field of study, and the ability to put the lessons learned into action.

If a given course includes most of the elements below, it meets this criteria and should be marked as “Pass”:

The course is taught by subject matter experts that are also good communicators to the stated course audience

The course includes interactive lectures by other subject matter experts related to this domain of study

The course includes a lively forum for students and experts to share insights, ask questions, and build community

The course includes a realistic project that drives home what is taught by allowing students to put it into practice

If the course description is missing most of these items, mark it as a “Fail”

I think this is a pretty good definition if I do say so myself. Good enough, that in my opinion as the BD (benevolent dictator), I could give it to an LLM Judge along with any course description as context and get a very accurate pass/fail.

Applying it to Shreya’s and Hamel’s “AI Evals” course, its an easy PASS! If you’re still not convinced, check out the many testimonials coming from other students here!

Lessons Learned

There are a lot of good blog posts coming out of this course from students that cover a myriad of really valuable insights derived from the course. I want to touch upon three that I’ve found really important in the context of my actual work and trying to put some of these ideas in practice.

Lesson #1: If you aren’t doing evals you are going to fail

Evals are hard. There is a cost to assuming an eval-centered or eval-directed approach in building your AI powered applications. But if you go forward without it, you either don’t really care about the quality of your product and/or you’re going to fail hard at some point.

For example, my previor employer’s IT group deployed a fairly general chat application to a university. There were folks asking if perhaps they were creating a solution to a problem that didn’t exist. It even led me to asking how they were doing their evals … getting ghosted … asking again … and getting ghosted again. Talking to end users, of which I was one, the product pales in comparison to what you can do with ChatGPT or even discover via google search, and as such, more and more folks are just abandoning it.

If your organization is building AI into their offerings, ask them the same question … ask them how they are doing evals. If they ghost you or say its something they are thinking about, then you have your answer.

Lesson #2: If you don’t have domain experts involved you are going to fail

Building good evals requires time from domain experts to ensure your prompts, eval rubrics, human/LLM judges, etc… perform in the way intended.

If you want to tell me that you don’t really care about the quality of your AI powered application without telling me, just tell me the domain experts don’t have time to be a part of this process and for the developer to do it himself. FYI, this was when I realized at my previous employeement that my supervisor didn’t really care about AI except as a bullet point in a Powerpoint presentation to sell clients on our services.

Lesson #3: You need to know how to write English well or you are going to fail

You need to be able to articulate your intent very clearly to both LLMs and humans so they can accurately perform the tasks you ask of them. Clarity and providing detailed instructions in your prompts and evals are critical for success in this business.

Here’s a few recommendations on improving your prompting acumen:

Read

Reading improves your vocabularly and offers you a way to learn how to speak by mimicry of styles you find easy to follow yourself. Whether its good sci-fi, an article on talking to Claude, or a course reader on building evals, reading is such as easy way to improve your ability to communicate intent to both LLMs and human alike.

Read the Vendor/Model best practices

Every major vendor has prompting best practices that tell you how their models prefer being talked too. Some of these guides are model specific. Either way, knowing whether your model likes markdown or XML and how to properly instruct it can go a long way in building quality AI powered systems that produce the outcomes you want.

Build

Start building some apps. Whether its for work or pleasure, start building out some AI powered web applications to get practice communicating your intentions to both LLMs and your fellow human folk. One of the things I really liked about this course is that there is a real-world course project that allows just that for students willing to put some time into it.

Conclusion

Don’t fail!

If you are in any role at a business building AI powered anything, this course is worth your time. Evals are hard and they are also essential. If you don’t understand where your application is failing or may fail, and you don’t have a systematic way to understand and then address these failures, your going to lose the trust of your users who will eventually move on to other products that do what they want and need.

If you’re interested in keeping those users and building meaningful AI, this course is a great place to start!