As the raw compute FLOPS become faster and memory bandwidth becomes higher for the latest GPUs, it becomes challenging for applications that launch large numbers of lightweight kernels to saturate GPU compute resources. We'll present the challenges we faced when adapting Desmond, the state-of-art code for performing molecular dynamics simulations for drug discovery, to the latest GPUs, and show how various CUDA features are utilized to overcome them.
Topics we'll cover include: • Employing CUDA graphs in dynamic environments to amortize the CUDA kernel and CUDA API launch overheads; • Using mapped memory to speed up the data transfers between the GPU and CPU; and • Using coroutine to delay the GPU synchronizations to reduce the GPU idle time.