Overview
This comprehensive tutorial video guides you through building a Vision Transformer (ViT) from scratch using PyTorch for image classification with custom datasets. Learn the complete workflow from data preparation to model deployment, including loading and transforming image datasets, creating patch embeddings, implementing Multi-Head Self-Attention mechanisms, building a Transformer Encoder for image processing, training and optimizing your ViT model, and testing with real data. Follow along with step-by-step instructions covering installation, dataset exploration, image patch creation, model building, training, and prediction. The 44-minute tutorial includes timestamps for easy navigation through each section, from introduction to final model testing. Access the complete code through the provided link and explore additional computer vision and visual language model tutorials from the creator's playlists and blog.
Syllabus
00:00 Introduction
00:55 Installation
04:15 Discover the dataset
06:46 How to load the dataset
15:46 How to split images to patches
30:40 Build and train VIT model
46:10 Test the model Prediction
Taught by
Eran Feit