Name: Accelerate Inference on NVIDIA GPUs S72330 | GTC 2025 | NVIDIA On-Demand
Uploaded: 2025-03-18T15:00:00Z
Duration: 2565 s
Description: Deploying large language models for inference at scale is inherently complex, often requiring intricate optimizations across compute-bound and memory-bound

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

详情

字幕

Deploying large language models for inference at scale is inherently complex, often requiring intricate optimizations across compute-bound and memory-bound regimes. We examine automatic vertical fusion, epilogue optimization, and adaptive kernel generation across batch sizes for GEMV and GEMM workloads, addressing key efficiency concerns, from NVIDIA CUDA® graph captures and optimized all-reduce strategies to custom kernel registrations. We'll highlight Together AI's journey in optimizing inference performance across the stack.

活动: GTC 25

日期: March 2025

话题: AI Platforms / Deployment - AI Inference / Inference Microservices

行业: 所有行业

NVIDIA 技术: CUDA

级别: 通用

语言: 英语

所在地: