Video Player is loading.
Current Time 0:00
Duration 0:00
Loaded: 0%
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected
      • Quality

      Accelerating End-to-End Large Language Models System using a Unified Inference Architecture and FP8

      , Team Manager, NVIDIA
      , Software Architect, Baichuan Intelligent
      The application of the Baichuan large language model (LLM) consists of a series of models of different sizes (hundreds of megabytes, tens of gigabytes, or even hundreds of GB) and types (autoregressive and non-autoregressive models). Deploying diverse models in a cluster to make the best use of GPU resources using a unified architecture is a challenge. Baichuan uses the unified architecture of Triton system and TensorRT-LLM to do model inference, which reduces system maintenance costs and improves GPU utilization. To solve the problem of memory consumption and memory access bottleneck of large models, L40S GPU is selected and low-bit weight plus FP8 activation and kvcache are used for inference. Finally, inference speed and cost reduction are greatly improved through TRT-LLM and TRT-LLM-based specific optimization.
      活动: GTC 24
      日期: March 2024
      话题: AI 推理
      行业: Cloud Services
      级别: 中级技术
      NVIDIA 技术: TensorRT,Triton
      语言: 英语
      所在地: