Organizing computation as asynchronous tasks with data-driven dependencies is a simple and efficient model for single- and multi-GPU programs. Sequential Task Flow (STF) is such a model that derives task graphs from data dependencies.
We propose CUDASTF, a C++ library that implements STF over CUDA APIs, fostering easy creation of scalable and composable algorithms. Users may easily elect to use CUDA Graphs instead of streams, which improves performance of small kernels. Structured kernels are automatically spread over multiple devices and can exercise fine-grained affinity control. Implementation-wise, CUDASTF makes a compelling argument for an event-based approach to asynchronous parallel libraries.
We obtain up to a 1.8x improvement over the cuSolverMg library on Cholesky decomposition. On a small weather simulation task we demonstrate near-optimal scalability of our multi-GPU kernels; also, on a single GPU, CUDA Graphs improve performance by up to 30%. Finally, we were able to author the first implementation of the CKKS fully homomorphic encryption scheme over multiple devices.