Name: Accelerate Super Long-Context LLM Inference S72568 | GTC 2025 | NVIDIA On-Demand
Uploaded: 2025-03-19T16:00:00Z
Duration: 2093 s
Description: As the context length of LLM serving continues to increase, the inference time escalates dramatically

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

详情

字幕

As the context length of LLM serving continues to increase, the inference time escalates dramatically. To strike this issue, we devise three key techniques: A fine-grained sparse attention algorithm for the pre-filling phase, equipped by a high-performance token-sparsity attention kernel using Tensor Core with CUTLASS CUTE, which delivers an 8X speedup at a 10X sparse rate without sacrificing accuracy; block-sparsity attention for the decoding phase, which further doubles the sparse rate while preserving accuracy with novel KV cache clustering; dynamic chunked pipeline parallelism, which decomposes a long sequence into chunks with adaptively updated chunk size based on an analytical cost model to balance pipeline stages, resulting in about 2X acceleration over Tensor Parallelism on eight GPUs. The above optimizations have empowered over 20X and 4X speedups for the pre-filling and decoding phases, respectively, while preserving accuracy with context length of 1 million tokens.

活动: GTC 25

日期: March 2025

话题: AI Platforms / Deployment - AI Inference / Inference Microservices

NVIDIA 技术: Cloud / Data Center GPU,CUDA,Hopper,cuBLAS,NCCL,NSight Comute,NSight Systems,NVLink / NVSwitch

行业: Cloud Services

级别: 通用

语言: 英语

所在地: