Video Player is loading.
Current Time 0:00
Duration 0:00
Loaded: 0%
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected
      • Quality

      Optimize Generative AI inference with Quantization in TensorRT-LLM and TensorRT

      , Senior Deep Learning Engineer, NVIDIA
      , Manager, AI/ML, NVIDIA
      Because running inference of AI models at large scale is computationally costly, optimization techniques are crucial to lower the inference cost. Our tutorial presents the TensorRT Model Optimization toolkit — a gateway for algorithmic model optimization by NVIDIA. TensorRT Model Optimization toolkit provides a set of state-of-the-art quantization methods, including FP8, Int8, Int4 and mixed precisions, as well as hardware-accelerated sparsity, and bridges those methods with the most advanced NVIDIA deployment solutions such as TensorRT-LLM. This tutorial includes an end-to-end optimization-to-deployment demo for language models with TensorRT-LLM and Stable Diffusion models with TensorRT. You can download the notebooks here: nvidia_ammo-0.9.0.tar.gz.
      活动: GTC 24
      日期: March 2024
      话题: AI 推理
      行业: 所有行业
      级别: 中级技术
      NVIDIA 技术: TensorRT
      语言: 英语
      所在地: