Overview
Dive into a 36-minute technical video that explores how Microsoft's Phi-4 model achieved multimodal capabilities despite its relatively small size. Learn about the innovative architecture using Mixture of LoRAs that enables Phi-4 to process text, images, and audio efficiently. The presentation walks through the four-stage training process, including vision training, audio training, and reasoning training, with detailed explanations of the model architecture and training data. Discover practical applications for Phi-4, see demonstrations of its capabilities through testing, and understand the technical details behind this breakthrough in small multimodal models. The video includes comprehensive information about the training data, particularly focusing on the visual training data that helped Phi-4 develop its image understanding abilities.
Syllabus
0:00 Intro
2:25 What You Could Use Phi-4 For
3:26 How Phi-4 Works Under the Hood
5:30 Model Architecture: Mixture of LoRAs
9:21 Q: Are the Weights for the Base Model Changed?
10:01 Testing Phi-4
13:56 Vision Training
14:41 The Four Stages of Training
17:56 Audio Training
19:26 Reasoning Training
21:56 Data and Training Details
24:59 The Visual Training Data
Taught by
Oxen