Large Models are not Always Expensive: Large Scale Mixture of Expert Models with Efficient Inference Empowers Microsoft Translator with Best Models (Presented by Microsoft Azure)
, Senior AI Developer Technology Engineer, NVIDIA
, Principal Researcher, Microsoft
Giant transformer models with billions of parameters are achieving state-of-the-art results on various natural language processing tasks. Such large models require computation linear in the number of parameters. To combat this problem, Mixture of Experts (MoE) introduces an architecture where the computation required is sub-linear in the number of parameters. We'll demonstrate the most efficient implementation of MoE on a single GPU to date, achieving 10-20x speedup over standard PyTorch libraries. We'll give an overview of the model architecture and training techniques, and describe extensions applied to NVIDIA’s FasterTransformer library to achieve state-of-the-art performance. Lastly, we'll demonstrate how we serve our 5 billion parameter model using AML and Triton to perform document translation for several language pairs with multilingual machine translation systems. We believe this work will help to unlock the potential of MoE models for production scenarios. Watch this session for a chance to be entered to win a special SWAG Box sponsored by Microsoft and NVIDIA.