Sponsor’s Workshop
Monday, May 20, 2024 ● 14:00 – 16:00 ● Room: Georgia B
14:00–15:30
Invited Talks/Keynotes
15:30–16:00
Coffee Break
16:00–17:30
Panel: Network Optimization for Large-Scale AI Clusters
With the rapid growth in sizes of modern AI models, large-scale distributed clusters, with orders of 1K, 10K, or even 100K of cards, have been widely deployed to meet memory and computation requirements. Communication time increases as cluster becomes larger, and network can become a bottleneck, resulting in sub-linear scaling in distributed training. Designing high-performance network systems to optimize communication in AI clusters is very critical and challenging. Such optimization includes but is not limited to efficient network topologies, routing algorithms, traffic engineering, communication protocols, collective scheduling, and fast fault discovery/recovery. This panel will discuss the insights, challenges, and opportunities in network optimization of large-scale clusters for distributed AI training and inference.
17:30–18:00
Networking Break