Video Player is loading.
Current Time 0:00
Duration 0:00
Loaded: 0%
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected

      Optimizing Parallelization and Overlap to Increase Training Efficiency using Megatron-Core

      , Solution Architect, NVIDIA
      , Foundation Model Training Director, Kuaishou Technology (pre-recorded)
      We’ll first discuss how to analyze profiling results on thousands of GPUs. Next, we’ll show our optimization solutions on each parallelism based on the Megatron Core. We built a perf model to tune the SM resources to maximize overlapping communication and computation. Both Tensor parallel and data parallel can employ the perf model. For pipeline parallel, we summarized the regions could be overlapped. Then, we're exploring the optimization of efficient CUDA kernels, especially on Hopper. Finally, we're investigating dynamically adaptive parallelism and pipelining solutions for MoE.

      We applied all the solutions to train our large language model with hundreds of billions of parameters. Compared with the baseline, the overlapped comm time increased 2.6x — the left comm have critical path that can’t be overlapped. End-to-end performance was improved by more than 25%. These analysis and optimization techniques can be widely applied to various models and training scales.
      活动: GTC 24
      日期: March 2024
      级别: 初级技术
      NVIDIA 技术: Cloud / Data Center GPU,NeMo
      行业: Consumer Internet
      话题: Performance Optimization
      语言: 英语
      所在地: