Beginning of dialog window. Escape will cancel and close the window.
End of dialog window.
详情
字幕
Accelerating End-to-End Large Language Models System using a Unified Inference Architecture and FP8
, Team Manager, NVIDIA
, Software Architect, Baichuan Intelligent
The application of the Baichuan large language model (LLM) consists of a series of models of different sizes (hundreds of megabytes, tens of gigabytes, or even hundreds of GB) and types (autoregressive and non-autoregressive models). Deploying diverse models in a cluster to make the best use of GPU resources using a unified architecture is a challenge. Baichuan uses the unified architecture of Triton system and TensorRT-LLM to do model inference, which reduces system maintenance costs and improves GPU utilization. To solve the problem of memory consumption and memory access bottleneck of large models, L40S GPU is selected and low-bit weight plus FP8 activation and kvcache are used for inference. Finally, inference speed and cost reduction are greatly improved through TRT-LLM and TRT-LLM-based specific optimization.