Faster Transformer + Triton:用于大型 NLP 模型推理的多节点实验方法 Faster Transformer Plus Triton: Experimental Approach on Multi-node for Giant NLP Model Inference
, Senior AI Developer Technology Engineer, NVIDIA
With GPT-3, it's impossible to run the inference of the entire model on a single GPU. We must extend to multiple GPUs, or even multiple node serving. We'll demonstrate how to integrate the FasterTransformer, which is highly optimized and flexible transformer library, with Triton inference server to serve the GPT-3 (175B) and Megatron Turing (530B) on multi-GPU, multi-node in one second, a common threshold in real applications.