Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Fine-tune Text to Speech Models: CSM-1B and Orpheus TTS

Trelis Research via YouTube

Overview

Coursera Plus Monthly Sale: All Certificates & Courses 40% Off!
This comprehensive tutorial from Trelis Research explores advanced text-to-speech model fine-tuning techniques focusing on CSM-1B and Orpheus TTS. Dive into the evolution of end-to-end audio and text models, comparing traditional TTS approaches with newer token-based methods. Learn about hierarchical tokenization for high-quality audio generation, explore the architectures of cutting-edge models like Sesame's CSM-1B, Canopy Labs' Orpheus, and Kyutai's Moshi. The tutorial provides hands-on demonstrations of voice cloning, inference, and fine-tuning techniques (both full and LoRA approaches) with practical Colab notebooks. Discover context-aware text-to-speech capabilities and gain insights into the future of token-based multi-modal models. Access repository resources, slides, and implementation examples to master advanced TTS technologies in 2025.

Syllabus

00:00 Introduction to End-to-End Audio + Text Models like GPT-4o and Llama 4 ?
01:04 End-to-End Multimodal Models and Their Capabilities
02:36 Traditional Approaches to Text-to-Speech
03:06 Token-Based Approaches and Their Advantages
03:25 Detailed Look at Orpheus and CSM-1B Models
06:58 Training and Inference with Token-Based Models
12:53 Hierarchical Tokenization for High-Quality Audio
14:11 Kyutai’s Moshi Model for Text + Speech
23:41 Sesame’s CSM-1B Model Architecture
25:13 Orpheus TTS architecture by Canopy Labs
27:34 Inferencing and Cloning with CSM-1B
40:13 Context Aware Text to Speech with CSM-1B
48:21 Orpheus Inference and Cloning - FREE Colab
55:09 Orpheus Voice Cloning Setup
01:01:20 Orpheus Fine-tuning Full fine-tuning and LoRA fine-tuning
01:09:55 Running Full Fine Tuning
01:19:33 Running LoRa Fine Tuning
01:25:20 Inference and Comparison
01:29:27 Inference with Cloning AND fine-tuning
01:35:48 The future of token-based multi-modal models

Taught by

Trelis Research

Reviews

Start your review of Fine-tune Text to Speech Models: CSM-1B and Orpheus TTS

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.