Beginning of dialog window. Escape will cancel and close the window.
End of dialog window.
详情
字幕
Accelerating Generative AI With TensorRT-LLM to Enhance Seller Experience at Amazon
, Senior AI DevTech Engineer, NVIDIA
, Senior Software Development Engineer, Amazon
In the realm of generative AI and large language model (LLM) applications in production, stringent latency and throughput targets during inference pose challenges for traditional unaccelerated solutions. We'll introduce the key GPU optimizations behind TensorRT-LLM and Triton, including quantization, in-flight batching, speculative decoding, and more.
We'll provide a comprehensive overview of the full spectrum support of decoder-only, encoder-decoder, and multi-modal models powered by TensorRT-LLM end-to-end. The Amazon Catalog team will co-present, and elucidate how the team successfully reduced latency and increased throughput using TensorRT-LLM. We'll end with a compelling case study, demonstrating the practical application of generative AI to enhance the seller experience and optimize product content.