Name: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference S61972 | GTC 2024 | NVIDIA On-Demand
Uploaded: 2024-03-21T10:00:00Z
Duration: 1475 s
Description: Autoregressive models, despite their commendable performance in myriad generative tasks, face challenges due to their inherently sequential structure

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

详情

字幕

Autoregressive models, despite their commendable performance in myriad generative tasks, face challenges due to their inherently sequential structure. This severely impedes computational efficiency, because a typical inference request may require more than thousands of tokens. This memory contention overhead becomes profound in real deployment, where requests arrive randomly, necessitating various generation lengths. Existing solutions, such as dynamic batching and concurrent instances, introduce significant response delays and bandwidth contention. To address these shortcomings, we propose a new temporal fusion framework for efficiently inferring multiple requests in parallel, named Flover. By orchestrating the temporal-level parallelism and a fast buffer reordering algorithm that allows memory eviction of finished tasks, it brings over 11x inference speed-up on GPT and 16x on Llama 65B, compared to the cutting-edge solutions provided by FasterTransformer.

活动: GTC 24

日期: March 2024

级别: 高级技术

话题: AI 推理

行业: 所有行业

NVIDIA 技术: CUDA,NCCL,Nsight Compute,Nsight Systems,TensorRT

语言: 英语

所在地: