Auto48:采用 Int4/Int8 混合精度于自动模型压缩及加速的通用框架 Auto48: A General Framework for Automatic Model Compression and Acceleration using Int4/Int8 Mixed Precision

, SW Engineer, Tencent
, SW, NVIDIA
BERT is widely used in the WeChat search business, such as query understanding, ranking, etc. We leveraged PyTorch-Quantization from NVIDIA to quantize BERT without losing accuracy for extreme performance, and deployed the quantized BERT in WeChat search with TensorRT. The speed is 6-8x faster than the implementation in the original framework. This motivates us to seek more aggressive model quantization methods to improvement the performance further. As a natural extension to the work, we proposed Auto48, which is an automatic tool featured with mixed 4-bit and 8-bit quantization. The resulting mix of quantized BERT models with Auto48 show around 30% further performance improvement in some cases, also without losing accuracy. Some low-precision kernels were implemented with CUTLASS. We'll adopt the improvement with TensorRT to reduce the cost of GPU inferencing in the future.
活动: GTC Digital Spring
日期: March 2022
行业: Consumer Internet
话题: Deep Learning - Frameworks
级别: 中级技术
语言: 英语
所在地: