Video Player is loading.
Current Time 0:00
Duration 0:00
Loaded: 0%
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected
      • Quality

      The Good, the Bad, and the Ugly: Cost-Efficient LLM Inference With Quantization, Pruning, and Distillation

      , Solutions Architect, NVIDIA
      , Solutions Architect, NVIDIA
      , Solutions Architect , NVIDIA
      , Solutions Architect, NVIDIA
      Dive into theoretical aspects and practical use cases for compression techniques used in creating more efficient versions of preferred models. Learn the main methods of compression — quantization, pruning, distillation — in detail, and compare trade-offs and options for jointly using multiple compression techniques. You'll become well-equipped to perform efficient inference and reduce costs when deploying LLMs. You'll understand the trade-offs in accuracy, latency, and hardware requirements for each method, and have hands-on expertise in deploying them using solutions from NVIDIA stack.
      Prerequisite(s):

      Python.
      Basic knowledge of LLMs.
      Familiarity with deep learning frameworks.
      活动: GTC 25
      日期: March 2025
      行业: 所有行业
      NVIDIA 技术: Cloud / Data Center GPU,TensorRT,Hopper,cuDF,NeMo,Triton
      话题: Development and Optimization - Performance Optimization
      级别: 通用
      语言: 英语
      所在地: