Overview
Watch a 31-minute technical talk where EleutherAI researcher and incoming Carnegie Mellon University PhD student Lintang Sutawika explores the intricate challenges of evaluating language models in NLP and AI. Learn about benchmarking methodologies, evaluation practices, and common assessment tasks while understanding their impact on language model research progress. Dive into advanced concepts including zero-shot capabilities, training dynamics, and multilingual extensions, with direct connections to practical machine learning applications. Examine critical topics like benchmark lifecycles, overfitting concerns, and system-level evaluation approaches, gaining practical insights for improving model comparison and evaluation methodologies. Access additional resources through the Pythia suite of open language models and EleutherAI's LM Evaluation Harness on GitHub to further explore the technical aspects of language model evaluation.
Syllabus
- Introduction
- A Key Challenge in LM Evaluation
- What do we want to evaluate?
- LM - Specific Complications
- Evaluating Models vs Systems
- Life of a Benchmark
- Overfitting
- Addressing Evaluation Pitfalls
- LM Evaluation is Challenging
Taught by
Open Data Science