Name: The Good, the Bad, and the Ugly: Cost-Efficient LLM Inference With Quantization, Pruning, and Distillation DLIT71489 | GTC 2025 | NVIDIA On-Demand
Uploaded: 2025-03-17T08:00:00Z
Duration: 2354 s
Description: Dive into theoretical aspects and practical use cases for compression techniques used in creating more efficient versions of preferred models

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

详情

字幕

Dive into theoretical aspects and practical use cases for compression techniques used in creating more efficient versions of preferred models. Learn the main methods of compression — quantization, pruning, distillation — in detail, and compare trade-offs and options for jointly using multiple compression techniques. You'll become well-equipped to perform efficient inference and reduce costs when deploying LLMs. You'll understand the trade-offs in accuracy, latency, and hardware requirements for each method, and have hands-on expertise in deploying them using solutions from NVIDIA stack.
Prerequisite(s):

Python.
Basic knowledge of LLMs.
Familiarity with deep learning frameworks.

活动: GTC 25

日期: March 2025

行业: 所有行业

NVIDIA 技术: Cloud / Data Center GPU,TensorRT,Hopper,cuDF,NeMo,Triton

话题: Development and Optimization - Performance Optimization

级别: 通用

语言: 英语

所在地: