Video Player is loading.
Current Time 0:00
Duration 0:00
Loaded: 0%
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected
      • Quality

      Accelerate Inference on NVIDIA GPUs

      , CTO, Together AI
      Deploying large language models for inference at scale is inherently complex, often requiring intricate optimizations across compute-bound and memory-bound regimes. We examine automatic vertical fusion, epilogue optimization, and adaptive kernel generation across batch sizes for GEMV and GEMM workloads, addressing key efficiency concerns, from NVIDIA CUDA® graph captures and optimized all-reduce strategies to custom kernel registrations. We'll highlight Together AI's journey in optimizing inference performance across the stack.

      活动: GTC 25
      日期: March 2025
      话题: AI Platforms / Deployment - AI Inference / Inference Microservices
      行业: 所有行业
      NVIDIA 技术: CUDA
      级别: 通用
      语言: 英语
      所在地: