Overview
Syllabus
00:00 Introduction to End-to-End Audio + Text Models like GPT-4o and Llama 4 ?
01:04 End-to-End Multimodal Models and Their Capabilities
02:36 Traditional Approaches to Text-to-Speech
03:06 Token-Based Approaches and Their Advantages
03:25 Detailed Look at Orpheus and CSM-1B Models
06:58 Training and Inference with Token-Based Models
12:53 Hierarchical Tokenization for High-Quality Audio
14:11 Kyutai’s Moshi Model for Text + Speech
23:41 Sesame’s CSM-1B Model Architecture
25:13 Orpheus TTS architecture by Canopy Labs
27:34 Inferencing and Cloning with CSM-1B
40:13 Context Aware Text to Speech with CSM-1B
48:21 Orpheus Inference and Cloning - FREE Colab
55:09 Orpheus Voice Cloning Setup
01:01:20 Orpheus Fine-tuning Full fine-tuning and LoRA fine-tuning
01:09:55 Running Full Fine Tuning
01:19:33 Running LoRa Fine Tuning
01:25:20 Inference and Comparison
01:29:27 Inference with Cloning AND fine-tuning
01:35:48 The future of token-based multi-modal models
Taught by
Trelis Research