Create a Large Language Model from Scratch with Python – Tutorial

Overview

Learn how to create a large language model from scratch using Python in this 6-hour tutorial. The course covers data handling, math, and transformers behind large language models. You will work with Python libraries, PyTorch, and implement concepts like embedding vectors, gradient descent, self-attention, and transformer architecture. The course is designed for individuals interested in natural language processing, machine learning, and deep learning.

Syllabus

Intro
Install Libraries
Pylzma build tools
Jupyter Notebook
Download wizard of oz
Experimenting with text file
Character-level tokenizer
Types of tokenizers
Tensors instead of Arrays
Linear Algebra heads up
Train and validation splits
Premise of Bigram Model
Inputs and Targets
Inputs and Targets Implementation
Batch size hyperparameter
Switching from CPU to CUDA
PyTorch Overview
CPU vs GPU performance in PyTorch
More PyTorch Functions
Embedding Vectors
Embedding Implementation
Dot Product and Matrix Multiplication
Matmul Implementation
Int vs Float
Recap and get_batch
nnModule subclass
Gradient Descent
Logits and Reshaping
Generate function and giving the model some context
Logits Dimensionality
Training loop + Optimizer + Zerograd explanation
Optimizers Overview
Applications of Optimizers
Loss reporting + Train VS Eval mode
Normalization Overview
ReLU, Sigmoid, Tanh Activations
Transformer and Self-Attention
Transformer Architecture
Building a GPT, not Transformer model
Self-Attention Deep Dive
GPT architecture
Switching to Macbook
Implementing Positional Encoding
GPTLanguageModel initalization
GPTLanguageModel forward pass
Standard Deviation for model parameters
Transformer Blocks
FeedForward network
Multi-head Attention
Dot product attention
4:19:43 Why we scale by 1/sqrtdk
Sequential VS ModuleList Processing
Overview Hyperparameters
Fixing errors, refining
Begin training
OpenWebText download and Survey of LLMs paper
How the dataloader/batch getter will have to change
Extract corpus with winrar
Python data extractor
Adjusting for train and val splits
Adding dataloader
Training on OpenWebText
Training works well, model loading/saving
Pickling
Fixing errors + GPU Memory in task manager
Command line argument parsing
Porting code to script
Prompt: Completion feature + more errors
nnModule inheritance + generation cropping
Pretraining vs Finetuning
R&D pointers
Outro