Overview
Explore how Bytedance's infrastructure team enhanced Spark to support large-scale GPU-based model inference on Kubernetes. Learn about the challenges faced in migrating from Hadoop to Kubernetes, including GPU computing power supply issues, resource pool scaling, and online resource waste. Discover the solutions implemented through GPU sharing technology, mixed GPU scheduling, Spark engine improvements, and platform enhancements. Gain insights into how these advancements enabled efficient processing of 8 billion multi-modal training data points using 7,000 mixed GPUs in just 7.5 hours, significantly improving resource efficiency and stability for Wanka model inference practices.
Syllabus
Bytedance Spark Supports Wanka Model Inference Practices
Taught by
The ASF