Overview
In this 10-minute conference talk from Conf42 LLMs 2025, Antara Raman Sahay explores the intricate process behind creating large language models. Discover the critical role of code data in pre-training, understand the various phases of training language models, and learn about experimental frameworks used in model development. Examine how code data impacts model performance, with insights on optimal proportions and quality requirements. The presentation covers key findings, practical recommendations for LLM development, and concludes with future research directions and essential takeaways for anyone interested in language model creation.
Syllabus
00:00 Introduction to Language Models
00:18 Importance of Code Data in Pre-Training
01:16 Phases of Training Language Models
02:59 Experimental Framework and Setup
03:30 Impact of Code Data on Model Performance
04:53 Code Data Proportion and Quality
06:58 Key Findings and Recommendations
08:22 Future Research and Final Takeaways
Taught by
Conf42