CUTLASS: Python API, Enhancements, and NVIDIA Hopper

, NVIDIA
, Principal Compute Architect, NVIDIA
高度评价
The latest release of CUTLASS delivers a new Python API for designing, JIT compiling, and launching optimized matrix computations from a Python environment. The functionality of CUTLASS has also been extended to include grouped and depthwise separable convolution, fused kernels for layernorm and multihead attention, and optimizations to grouped GEMM. Additionally, CUTLASS 2.11 takes advantage of new features on NVIDIA's Hopper architecture, including 2x faster FP64 Tensor Cores and FP8 numerical conversion. We'll describe implementation details of these computations and optimization techniques for achieving peak performance. We'll also provide a preview of CUTLASS 3.0, which offers an enhanced programming model for implementing tensor computations using CUDA.
活动: GTC Digital September
日期: September 2022
话题: Accelerated Computing & Dev Tools - Performance Optimization
级别: 高级技术
行业: 所有行业
语言: 英语
话题: Accelerated Computing & Dev Tools - Programming Languages / Compilers
所在地: