Beginning of dialog window. Escape will cancel and close the window.
End of dialog window.
详情
字幕
Accelerate Inference on NVIDIA GPUs
, CTO, Together AI
Deploying large language models for inference at scale is inherently complex, often requiring intricate optimizations across compute-bound and memory-bound regimes. We examine automatic vertical fusion, epilogue optimization, and adaptive kernel generation across batch sizes for GEMV and GEMM workloads, addressing key efficiency concerns, from NVIDIA CUDA® graph captures and optimized all-reduce strategies to custom kernel registrations. We'll highlight Together AI's journey in optimizing inference performance across the stack.
活动: GTC 25
日期: March 2025
话题: AI Platforms / Deployment - AI Inference / Inference Microservices