Beginning of dialog window. Escape will cancel and close the window.
End of dialog window.
详情
字幕
Accelerate Super Long-Context LLM Inference
, Product Director of Big Data and AI platform, Alibaba Cloud Intelligence Group
As the context length of LLM serving continues to increase, the inference time escalates dramatically. To strike this issue, we devise three key techniques: A fine-grained sparse attention algorithm for the pre-filling phase, equipped by a high-performance token-sparsity attention kernel using Tensor Core with CUTLASS CUTE, which delivers an 8X speedup at a 10X sparse rate without sacrificing accuracy; block-sparsity attention for the decoding phase, which further doubles the sparse rate while preserving accuracy with novel KV cache clustering; dynamic chunked pipeline parallelism, which decomposes a long sequence into chunks with adaptively updated chunk size based on an analytical cost model to balance pipeline stages, resulting in about 2X acceleration over Tensor Parallelism on eight GPUs. The above optimizations have empowered over 20X and 4X speedups for the pre-filling and decoding phases, respectively, while preserving accuracy with context length of 1 million tokens.
活动: GTC 25
日期: March 2025
话题: AI Platforms / Deployment - AI Inference / Inference Microservices
NVIDIA 技术: Cloud / Data Center GPU,CUDA,Hopper,cuBLAS,NCCL,NSight Comute,NSight Systems,NVLink / NVSwitch