Accelerated LLM Model Alignment and Deployment in NeMo, TensorRT-LLM, and Triton Inference Server

, Deep Learning Solutions Architect, NVIDIA
, Deep Learning Solutions Architect, NVIDIA
The demand for accelerated large language models (LLMs) has surged with the growing popularity of generative models. These models, often boasting billions of parameters, hold immense potential, but also pose challenges during large-scale deployments. Join us as we delve into the world of accelerated LLM model alignment using the NeMo Framework and inference optimization and deployment through NVIDIA's TensorRT-LLM and Triton Inference Server.
We'll spotlight the innovative SFT and PEFT Fine-tuning (LoRA) approach, a key component for LLM alignment. We'll also uncover the intricacies of inference optimization using TensorRT-LLM, highlighting kv-caching, paged attention, in-flight batching, and the pivotal role it plays in making LLMs faster and more cost-effective. We'll take you through the crucial steps of fine-tuning, optimizing, and deploying LLaMA model in production environment using Triton Inference Server.
Prerequisite(s):

Familiarity with Python, Large language Models, and Deep Learning Frameworks


Explore more training options offered by the NVIDIA Deep Learning Institute (DLI). Choose from an extensive catalog of self-paced, online courses or instructor-led virtual workshops to help you develop key skills in AI, HPC, graphics & simulation, and more.
Ready to validate your skills? Get NVIDIA certified and distinguish yourself in the industry.

活动: GTC 24
日期: March 2024
NVIDIA 技术: Cloud / Data Center GPU,HGX,NCCL,NeMo,TensorRT,Triton
级别: 中级技术
话题: Large Language Models (LLMs)
行业: Retail / Consumer Packaged Goods
语言: 英语
所在地: