Toward INT8 Inference: An End-to-End Workflow for Deploying Quantization-Aware Trained Networks Using TensorRT
, Deep Learning Software Engineer, NVIDIA
, Senior Deep Learning Software Engineer, NVIDIA
高度评价
Converting floating-point deep neural networks to INT8 precision may significantly reduce inference time and memory footprint. This can be done with either post-training quantization (PTQ) or quantization-aware training (QAT). PTQ uses a calibration dataset to quantize the model after training, which can result in accuracy degradation due to the quantization not being reflected in the training process. QAT, on the other hand, is better at maintaining accuracy by introducing “quantization and de-quantization (QDQ)” nodes, which simulate lower precision around desired layers during training or fine-tuning. We'll describe how to add QDQ nodes in a TensorFlow 2.0 model, fine-tune it, and convert it into a TensorRT engine via ONNX; and how to convert a PyTorch QAT model into a TorchScript model and deploy it with TensorRT. We can achieve significant latency reduction with minimal impact on accuracy with our QAT-based TensorRT deployment strategies.