Beginning of dialog window. Escape will cancel and close the window.
End of dialog window.
详情
字幕
The Good, the Bad, and the Ugly: Cost-Efficient LLM Inference With Quantization, Pruning, and Distillation
, Solutions Architect, NVIDIA
, Solutions Architect, NVIDIA
, Solutions Architect , NVIDIA
, Solutions Architect, NVIDIA
Dive into theoretical aspects and practical use cases for compression techniques used in creating more efficient versions of preferred models. Learn the main methods of compression — quantization, pruning, distillation — in detail, and compare trade-offs and options for jointly using multiple compression techniques. You'll become well-equipped to perform efficient inference and reduce costs when deploying LLMs. You'll understand the trade-offs in accuracy, latency, and hardware requirements for each method, and have hands-on expertise in deploying them using solutions from NVIDIA stack. Prerequisite(s):
Python. Basic knowledge of LLMs. Familiarity with deep learning frameworks.
活动: GTC 25
日期: March 2025
行业: 所有行业
NVIDIA 技术: Cloud / Data Center GPU,TensorRT,Hopper,cuDF,NeMo,Triton
话题: Development and Optimization - Performance Optimization