Video Player is loading.
Current Time 0:00
Duration 0:00
Loaded: 0%
Stream Type LIVE
Remaining Time 0:00
 
1x
    • Chapters
    • descriptions off, selected
    • subtitles off, selected
      • Quality

      Serving 1 Million BERT Inference Requests for only 20 cents on AWS Cloud

      , NVIDIA

      Attention based models like BERT have revolutionized natural language processing (NLP) due to its ability to outperform traditional models on language tasks as shown by their high scores on various NLP benchmarks However, even smaller BERT models have more than 100 million parameters, making it difficult to achieve near-real-time Inference speeds on generic compute hardware. GPUs have generally been outperforming CPUs for BERT Inference, however they generally cost more than CPU instances on AWS. New Tensor core GPUs have been proven to be more cost-effective and efficient for running Inference models. In this talk, we will present a solution for performing inference on the popular BERT model in less than 4ms using Nvidia T4 GPUs on AWS EC2-G4 Instance. We will cover specific optimizations on the model layers, such as Softmax, Bias terms addition, Gaussian Error Linear Units and Multi-Head attention that can significantly accelerate the BERT Inference performance. Our solution is built to improve performance of NLP tasks like Question Answering and Classification tasks like Sentiment Analysis and Domain Classification. All this work has been implemented as part of Apache MXNet and Gluon NLP frameworks, and has been made available as part of the latest MXNet release Lastly, we will cover how a user can leverage the power of SageMaker to deploy the optimized BERT model and be able to serve One Million BERT Inference requests for less than 20 cents on AWS.

      活动: AWS reInvent
      日期: November 2020
      级别: 初级技术
      行业: Cloud Services
      话题: Deep Learning Inference - Optimization and Deployment
      语言: English, Japanese, Korean, Traditional Chinese
      所在地: