Name: Serving 1 Million BERT Inference Requests for only 20 cents on AWS Cloud | AWS reInvent 2020 | NVIDIA On-Demand
Uploaded: 2020-11-30T00:00:01Z
Duration: 1787 s
Description: Attention based models like BERT have revolutionized natural language processing (NLP) due to its ability to outperform traditional models on language task

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

详情

字幕

Attention based models like BERT have revolutionized natural language processing (NLP) due to its ability to outperform traditional models on language tasks as shown by their high scores on various NLP benchmarks However, even smaller BERT models have more than 100 million parameters, making it difficult to achieve near-real-time Inference speeds on generic compute hardware. GPUs have generally been outperforming CPUs for BERT Inference, however they generally cost more than CPU instances on AWS. New Tensor core GPUs have been proven to be more cost-effective and efficient for running Inference models. In this talk, we will present a solution for performing inference on the popular BERT model in less than 4ms using Nvidia T4 GPUs on AWS EC2-G4 Instance. We will cover specific optimizations on the model layers, such as Softmax, Bias terms addition, Gaussian Error Linear Units and Multi-Head attention that can significantly accelerate the BERT Inference performance. Our solution is built to improve performance of NLP tasks like Question Answering and Classification tasks like Sentiment Analysis and Domain Classification. All this work has been implemented as part of Apache MXNet and Gluon NLP frameworks, and has been made available as part of the latest MXNet release Lastly, we will cover how a user can leverage the power of SageMaker to deploy the optimized BERT model and be able to serve One Million BERT Inference requests for less than 20 cents on AWS.

活动: AWS reInvent

日期: November 2020

级别: 初级技术

行业: Cloud Services

话题: Deep Learning Inference - Optimization and Deployment

语言: English, Japanese, Korean, Traditional Chinese

所在地: