Name: Toward INT8 Inference: An End-to-End Workflow for Deploying Quantization-Aware Trained Networks Using TensorRT S41440 | GTC Digital Spring 2022 | NVIDIA On-Demand
Uploaded: 2022-03-24T07:00:00Z
Duration: 1331 s
Description: Converting floating-point deep neural networks to INT8 precision may significantly reduce inference time and memory footprint

详情

字幕

Converting floating-point deep neural networks to INT8 precision may significantly reduce inference time and memory footprint. This can be done with either post-training quantization (PTQ) or quantization-aware training (QAT). PTQ uses a calibration dataset to quantize the model after training, which can result in accuracy degradation due to the quantization not being reflected in the training process. QAT, on the other hand, is better at maintaining accuracy by introducing “quantization and de-quantization (QDQ)” nodes, which simulate lower precision around desired layers during training or fine-tuning. We'll describe how to add QDQ nodes in a TensorFlow 2.0 model, fine-tune it, and convert it into a TensorRT engine via ONNX; and how to convert a PyTorch QAT model into a TorchScript model and deploy it with TensorRT. We can achieve significant latency reduction with minimal impact on accuracy with our QAT-based TensorRT deployment strategies.

活动: GTC Digital Spring

日期: March 2022

级别: 高级技术

行业: 汽车 / 运输

话题: Deep Learning - Inference

语言: 英语

所在地: