Name: Accelerating Generative AI With TensorRT-LLM to Enhance Seller Experience at Amazon S62390 | GTC 2024 | NVIDIA On-Demand
Uploaded: 2024-03-19T11:00:00Z
Duration: 2870 s
Description: In the realm of generative AI and large language model (LLM) applications in production, stringent latency and throughput targets during inference pose cha

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

详情

字幕

In the realm of generative AI and large language model (LLM) applications in production, stringent latency and throughput targets during inference pose challenges for traditional unaccelerated solutions. We'll introduce the key GPU optimizations behind TensorRT-LLM and Triton, including quantization, in-flight batching, speculative decoding, and more.

We'll provide a comprehensive overview of the full spectrum support of decoder-only, encoder-decoder, and multi-modal models powered by TensorRT-LLM end-to-end. The Amazon Catalog team will co-present, and elucidate how the team successfully reduced latency and increased throughput using TensorRT-LLM. We'll end with a compelling case study, demonstrating the practical application of generative AI to enhance the seller experience and optimize product content.

活动: GTC 24

日期: March 2024

级别: 高级技术

行业: Consumer Internet

话题: Large Language Models (LLMs)

NVIDIA 技术: TensorRT,Triton

语言: 英语

所在地: