Explore the development of SEA-LION, an open-source large language model designed to represent the diverse languages and cultural contexts of Southeast Asia, in this 36-minute conference talk. Discover how AI Singapore collaborated with Databricks MosaicML to create a localized LLM capable of handling multiple languages, including Thai, Indonesian, and Tamil, as well as unique linguistic phenomena like code-switching between dialects. Learn about the design considerations, from customizing tokenizers for regional languages to ensuring cost-effectiveness for resource-constrained organizations. Gain insights into potential applications and the long-term vision for this innovative model that aims to bridge the gap in language representation for Southeast Asian communities.
SEA-LION - Representing Diverse Southeast Asian Languages with Large Language Models
Databricks via YouTube
Overview
Syllabus
SEA-LION: Representing the Diverse Languages of Southeast Asia with LLMs
Taught by
Databricks