Accelerate ETL and Machine Learning in Apache Spark

, Senior Manager, Distributed Machine Learning, NVIDIA
, Senior Director of Engineering , NVIDIA
Spark SQL and Spark MLlib are the two most popular components of Apache Spark, used for massively scaling extract-transform-load (ETL) and classical machine learning (ML) training and inference workloads. We'll discuss the RAPIDS Accelerator for Apache Spark ETL and MLlib. We'll demonstrate the performance of RAPIDS Spark for ETL on an industry standard benchmark at scale factor 100TB and share results on the cloud and on-premises using standard hardware. We'll also demonstrate results on a cluster of Grace Hopper nodes.

Then we'll introduce spark-rapids-ml, a new open-source Python package enabling the GPU acceleration of Spark MLlib's PySpark API. We'll provide an overview of its latest capabilities, explain how users can leverage it in their Spark ML applications with essentially no code change, delve into the design and architecture, and present estimated cost benefits approaching 40x for the most computationally demanding ML algorithms and datasets.
活动: GTC 24
日期: March 2024
行业: 所有行业
NVIDIA 技术: Cloud / Data Center GPU,RAPIDS
话题: ETL Processing
级别: 中级技术
语言: 英语
所在地: