Name: How Hugging Face Delivers 1 Millisecond Inference Latency for Transformers in Infinity A31172 | GTC Digital November 2021 | NVIDIA On-Demand
Uploaded: 2021-11-10T18:00:00Z
Duration: 2714 s
Description: Learn how Hugging Face achieved 1 millisecond Transformers inference for customers of its new Infinity solution

详情

字幕

Learn how Hugging Face achieved 1 millisecond Transformers inference for customers of its new Infinity solution. Transformers models conquered natural language processing with breakthrough accuracy. But these large models and complex architectures are too challenging for most companies to put in production with enough performance to power real-time experiences like semantic search, and large workloads like frequent text classification over large datasets. Introducing Infinity: a containerized solution to deploy end-to-end optimized inference pipelines achieving unprecedented single-digit millisecond latency on models like BERT, leveraging state-of-the-art libraries Triton Inference Server, TensorRT and ONNX RunTime. Learn how Infinity pilot customers were able to easily and securely deploy Transformers models with up to 100x acceleration within their own cloud and on-premises infrastructures, enabling new use cases and lowering production costs.

活动: GTC Digital November

日期: November 2021

行业: Consumer Internet

话题: Conversational AI / NLP

级别: 中级技术

语言: 英语

话题: Deep Learning - Inference

所在地: