How Hugging Face Delivers 1 Millisecond Inference Latency for Transformers in Infinity

, Hugging Face
Learn how Hugging Face achieved 1 millisecond Transformers inference for customers of its new Infinity solution. Transformers models conquered natural language processing with breakthrough accuracy. But these large models and complex architectures are too challenging for most companies to put in production with enough performance to power real-time experiences like semantic search, and large workloads like frequent text classification over large datasets. Introducing Infinity: a containerized solution to deploy end-to-end optimized inference pipelines achieving unprecedented single-digit millisecond latency on models like BERT, leveraging state-of-the-art libraries Triton Inference Server, TensorRT and ONNX RunTime. Learn how Infinity pilot customers were able to easily and securely deploy Transformers models with up to 100x acceleration within their own cloud and on-premises infrastructures, enabling new use cases and lowering production costs.
活动: GTC Digital November
日期: November 2021
行业: Consumer Internet
话题: Conversational AI / NLP
级别: 中级技术
语言: 英语
话题: Deep Learning - Inference
所在地: