Overview
This 31-minute technical talk explores the architecture behind Oracle's zettascale OCI Superclusters designed for next-generation AI workloads. Join Oracle SVP Pradeep Vincent and VP Jag Brar as they dive into the technical foundations enabling massive AI infrastructure scaling—from cluster networks supporting up to 131,000 NVIDIA Blackwell GPUs to RDMA implementations delivering ultra-high throughput and ultra-low latency. Discover innovations specifically engineered for generative AI workloads, including advanced link resiliency techniques that enhance reliability in massive clusters, strategies for mitigating link flap network disruptions, and sophisticated traffic load balancing approaches. The presentation provides a detailed technical breakdown with timestamps covering each component of the architecture, from network fundamentals to collectives-aware load balancing, concluding with key takeaways for implementing scalable AI infrastructure. Part of Oracle's technical series on generative AI infrastructure, with links to related episodes covering complementary topics.
Syllabus
00:00 - Zettascale OCI Superclusters
01:28 - Cluster Networks for Zettascale
03:12 - RDMA at Zettascale with Ultra-high Throughput
06:30 - Innovating for Gen AI Workloads
08:48 - Achieving Ultra-low Network Latency
13:09 - Advanced Link Resiliency for Enhanced Gen AI Workload Reliability
14:58 - Achieving Reliability in Massive Clusters
16:59 - Defining Link Flap Network Disruptions
21:47 - Mitigating Link Flaps
24:49 - Advanced Traffic Load Balancing
27:18 - Collectives-Aware Load Balancing
29:57 - Conclusion and Key Takeaways
Taught by
Oracle