Video Player is loading.
Current Time 0:00
Duration 0:00
Loaded: 0%
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected
      • Quality

      Accelerating Generative AI With TensorRT-LLM to Enhance Seller Experience at Amazon

      , Senior AI DevTech Engineer, NVIDIA
      , Senior Software Development Engineer, Amazon
      In the realm of generative AI and large language model (LLM) applications in production, stringent latency and throughput targets during inference pose challenges for traditional unaccelerated solutions. We'll introduce the key GPU optimizations behind TensorRT-LLM and Triton, including quantization, in-flight batching, speculative decoding, and more.

      We'll provide a comprehensive overview of the full spectrum support of decoder-only, encoder-decoder, and multi-modal models powered by TensorRT-LLM end-to-end. The Amazon Catalog team will co-present, and elucidate how the team successfully reduced latency and increased throughput using TensorRT-LLM. We'll end with a compelling case study, demonstrating the practical application of generative AI to enhance the seller experience and optimize product content.
      活动: GTC 24
      日期: March 2024
      级别: 高级技术
      行业: Consumer Internet
      话题: Large Language Models (LLMs)
      NVIDIA 技术: TensorRT,Triton
      语言: 英语
      所在地: