Beginning of dialog window. Escape will cancel and close the window.
End of dialog window.
详情
字幕
A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference
, Research Scientist, The Ohio State University
Autoregressive models, despite their commendable performance in myriad generative tasks, face challenges due to their inherently sequential structure. This severely impedes computational efficiency, because a typical inference request may require more than thousands of tokens. This memory contention overhead becomes profound in real deployment, where requests arrive randomly, necessitating various generation lengths. Existing solutions, such as dynamic batching and concurrent instances, introduce significant response delays and bandwidth contention. To address these shortcomings, we propose a new temporal fusion framework for efficiently inferring multiple requests in parallel, named Flover. By orchestrating the temporal-level parallelism and a fast buffer reordering algorithm that allows memory eviction of finished tasks, it brings over 11x inference speed-up on GPT and 16x on Llama 65B, compared to the cutting-edge solutions provided by FasterTransformer.