Video Player is loading.
Current Time 0:00
Duration 0:00
Loaded: 0%
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected

      Accelerate Super Long-Context LLM Inference

      , Product Director of Big Data and AI platform, Alibaba Cloud Intelligence Group
      As the context length of LLM serving continues to increase, the inference time escalates dramatically. To strike this issue, we devise three key techniques: A fine-grained sparse attention algorithm for the pre-filling phase, equipped by a high-performance token-sparsity attention kernel using Tensor Core with CUTLASS CUTE, which delivers an 8X speedup at a 10X sparse rate without sacrificing accuracy; block-sparsity attention for the decoding phase, which further doubles the sparse rate while preserving accuracy with novel KV cache clustering; dynamic chunked pipeline parallelism, which decomposes a long sequence into chunks with adaptively updated chunk size based on an analytical cost model to balance pipeline stages, resulting in about 2X acceleration over Tensor Parallelism on eight GPUs. The above optimizations have empowered over 20X and 4X speedups for the pre-filling and decoding phases, respectively, while preserving accuracy with context length of 1 million tokens.

      活动: GTC 25
      日期: March 2025
      话题: AI Platforms / Deployment - AI Inference / Inference Microservices
      NVIDIA 技术: Cloud / Data Center GPU,CUDA,Hopper,cuBLAS,NCCL,NSight Comute,NSight Systems,NVLink / NVSwitch
      行业: Cloud Services
      级别: 通用
      语言: 英语
      所在地: