This 27-minute tutorial explores three essential evaluation methods for Large Language Model (LLM) applications and demonstrates practical implementation strategies. Learn how to conduct effective "vibe checks" before diving into formal evaluation techniques including code-based, human-based, and LLM-based evaluation approaches. The video provides a comprehensive walkthrough of each method with practical examples, including a detailed case study showing how to improve applications using LLM Judge. Follow along with the provided GitHub code examples and supplementary resources to implement these evaluation frameworks in your own AI projects. Perfect for developers and AI enthusiasts looking to systematically assess and enhance their LLM application performance.
Overview
Syllabus
Intro - 0:00
Vibe Checks - 0:27
Evals - 3:26
Type 1: Code-based - 5:58
Type 2: Human-based - 9:34
Type 3: LLM-based - 13:34
Example: Improving y2b with LLM Judge - 15:28
Taught by
Shaw Talebi