Completed
00:00 Introduction to End-to-End Audio + Text Models like GPT-4o and Llama 4 ?
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Fine-tune Text to Speech Models: CSM-1B and Orpheus TTS
Automatically move to the next video in the Classroom when playback concludes
- 1 00:00 Introduction to End-to-End Audio + Text Models like GPT-4o and Llama 4 ?
- 2 01:04 End-to-End Multimodal Models and Their Capabilities
- 3 02:36 Traditional Approaches to Text-to-Speech
- 4 03:06 Token-Based Approaches and Their Advantages
- 5 03:25 Detailed Look at Orpheus and CSM-1B Models
- 6 06:58 Training and Inference with Token-Based Models
- 7 12:53 Hierarchical Tokenization for High-Quality Audio
- 8 14:11 Kyutai’s Moshi Model for Text + Speech
- 9 23:41 Sesame’s CSM-1B Model Architecture
- 10 25:13 Orpheus TTS architecture by Canopy Labs
- 11 27:34 Inferencing and Cloning with CSM-1B
- 12 40:13 Context Aware Text to Speech with CSM-1B
- 13 48:21 Orpheus Inference and Cloning - FREE Colab
- 14 55:09 Orpheus Voice Cloning Setup
- 15 01:01:20 Orpheus Fine-tuning Full fine-tuning and LoRA fine-tuning
- 16 01:09:55 Running Full Fine Tuning
- 17 01:19:33 Running LoRa Fine Tuning
- 18 01:25:20 Inference and Comparison
- 19 01:29:27 Inference with Cloning AND fine-tuning
- 20 01:35:48 The future of token-based multi-modal models