Video Player is loading.
Current Time 0:00
Duration 0:00
Loaded: 0%
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected
      • Quality

      A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

      , Research Scientist, The Ohio State University
      Autoregressive models, despite their commendable performance in myriad generative tasks, face challenges due to their inherently sequential structure. This severely impedes computational efficiency, because a typical inference request may require more than thousands of tokens. This memory contention overhead becomes profound in real deployment, where requests arrive randomly, necessitating various generation lengths. Existing solutions, such as dynamic batching and concurrent instances, introduce significant response delays and bandwidth contention. To address these shortcomings, we propose a new temporal fusion framework for efficiently inferring multiple requests in parallel, named Flover. By orchestrating the temporal-level parallelism and a fast buffer reordering algorithm that allows memory eviction of finished tasks, it brings over 11x inference speed-up on GPT and 16x on Llama 65B, compared to the cutting-edge solutions provided by FasterTransformer.
      活动: GTC 24
      日期: March 2024
      级别: 高级技术
      话题: AI 推理
      行业: 所有行业
      NVIDIA 技术: CUDA,NCCL,Nsight Compute,Nsight Systems,TensorRT
      语言: 英语
      所在地: