Name: Accelerating End-to-End Large Language Models System using a Unified Inference Architecture and FP8 S61691 | GTC 2024 | NVIDIA On-Demand
Uploaded: 2024-03-19T16:00:00Z
Duration: 2843 s
Description: The application of the Baichuan large language model (LLM) consists of a series of models of different sizes (hundreds of megabytes, tens of gigabytes, or

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

详情

字幕

The application of the Baichuan large language model (LLM) consists of a series of models of different sizes (hundreds of megabytes, tens of gigabytes, or even hundreds of GB) and types (autoregressive and non-autoregressive models). Deploying diverse models in a cluster to make the best use of GPU resources using a unified architecture is a challenge. Baichuan uses the unified architecture of Triton system and TensorRT-LLM to do model inference, which reduces system maintenance costs and improves GPU utilization. To solve the problem of memory consumption and memory access bottleneck of large models, L40S GPU is selected and low-bit weight plus FP8 activation and kvcache are used for inference. Finally, inference speed and cost reduction are greatly improved through TRT-LLM and TRT-LLM-based specific optimization.

活动: GTC 24

日期: March 2024

话题: AI 推理

行业: Cloud Services

级别: 中级技术

NVIDIA 技术: TensorRT,Triton

语言: 英语

所在地: