Name: Optimizing Parallelization and Overlap to Increase Training Efficiency using Megatron-Core S61626 | GTC 2024 | NVIDIA On-Demand
Uploaded: 2024-03-21T08:00:00Z
Duration: 3021 s
Description: We’ll first discuss how to analyze profiling results on thousands of GPUs

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

详情

字幕

We’ll first discuss how to analyze profiling results on thousands of GPUs. Next, we’ll show our optimization solutions on each parallelism based on the Megatron Core. We built a perf model to tune the SM resources to maximize overlapping communication and computation. Both Tensor parallel and data parallel can employ the perf model. For pipeline parallel, we summarized the regions could be overlapped. Then, we're exploring the optimization of efficient CUDA kernels, especially on Hopper. Finally, we're investigating dynamically adaptive parallelism and pipelining solutions for MoE.

We applied all the solutions to train our large language model with hundreds of billions of parameters. Compared with the baseline, the overlapped comm time increased 2.6x — the left comm have critical path that can’t be overlapped. End-to-end performance was improved by more than 25%. These analysis and optimization techniques can be widely applied to various models and training scales.

活动: GTC 24

日期: March 2024

级别: 初级技术

NVIDIA 技术: Cloud / Data Center GPU,NeMo

行业: Consumer Internet

话题: Performance Optimization

语言: 英语

所在地: