Name: Accelerated LLM Model Alignment and Deployment in NeMo, TensorRT-LLM, and Triton Inference Server DLIT61739 | GTC 2024 | NVIDIA On-Demand
Uploaded: 2024-03-19T13:00:00Z
Duration: 5886 s
Description: The demand for accelerated large language models (LLMs) has surged with the growing popularity of generative models

详情

字幕

The demand for accelerated large language models (LLMs) has surged with the growing popularity of generative models. These models, often boasting billions of parameters, hold immense potential, but also pose challenges during large-scale deployments. Join us as we delve into the world of accelerated LLM model alignment using the NeMo Framework and inference optimization and deployment through NVIDIA's TensorRT-LLM and Triton Inference Server.
We'll spotlight the innovative SFT and PEFT Fine-tuning (LoRA) approach, a key component for LLM alignment. We'll also uncover the intricacies of inference optimization using TensorRT-LLM, highlighting kv-caching, paged attention, in-flight batching, and the pivotal role it plays in making LLMs faster and more cost-effective. We'll take you through the crucial steps of fine-tuning, optimizing, and deploying LLaMA model in production environment using Triton Inference Server.
Prerequisite(s):

Familiarity with Python, Large language Models, and Deep Learning Frameworks

Explore more training options offered by the NVIDIA Deep Learning Institute (DLI). Choose from an extensive catalog of self-paced, online courses or instructor-led virtual workshops to help you develop key skills in AI, HPC, graphics & simulation, and more.
Ready to validate your skills? Get NVIDIA certified and distinguish yourself in the industry.

活动: GTC 24

日期: March 2024

NVIDIA 技术: Cloud / Data Center GPU,HGX,NCCL,NeMo,TensorRT,Triton

级别: 中级技术

话题: Large Language Models (LLMs)

行业: Retail / Consumer Packaged Goods

语言: 英语

所在地: