Name: Auto48：采用 Int4/Int8 混合精度于自动模型压缩及加速的通用框架 Auto48: A General Framework for Automatic Model Compression and Acceleration using Int4/Int8 Mixed Precision S41611 | GTC Digital Spring 2022 | NVIDIA On-Demand
Uploaded: 2022-03-23T00:00:00Z
Duration: 1820 s
Description: BERT is widely used in the WeChat search business, such as query understanding, ranking, etc

详情

字幕

BERT is widely used in the WeChat search business, such as query understanding, ranking, etc. We leveraged PyTorch-Quantization from NVIDIA to quantize BERT without losing accuracy for extreme performance, and deployed the quantized BERT in WeChat search with TensorRT. The speed is 6-8x faster than the implementation in the original framework. This motivates us to seek more aggressive model quantization methods to improvement the performance further. As a natural extension to the work, we proposed Auto48, which is an automatic tool featured with mixed 4-bit and 8-bit quantization. The resulting mix of quantized BERT models with Auto48 show around 30% further performance improvement in some cases, also without losing accuracy. Some low-precision kernels were implemented with CUTLASS. We'll adopt the improvement with TensorRT to reduce the cost of GPU inferencing in the future.

活动: GTC Digital Spring

日期: March 2022

行业: Consumer Internet

话题: Deep Learning - Frameworks

级别: 中级技术

语言: 英语

所在地: