I am currently a member of NVIDIA Research, where I lead the Programming Systems and Applications Research Group. Prior to joining NVIDIA, I was an assistant professor in the Department of Computer Science of the University of Illinois at Urbana-Champaign. I graduated with my Ph.D. from the Computer Science Department of Carnegie Mellon University.
Deep neural networks (DNNs) frequently contain far more weights, represented at a higher precision, than are required for the specific task, which they are trained to perform. Consequently, they can often be compressed using techniques such as weight pruning and quantization that reduce both the model size and inference time without appreciable loss in accuracy. However, finding the best compression strategy and corresponding target sparsity for a given DNN, hardware platform, and optimization objective currently requires expensive, frequently manual, trial-and-error experimentation. In this article, we introduce a programmable system for model compression called Condensa. Users programmatically compose simple operators, in Python, to build more complex and practically interesting compression strategies. Given a strategy and user-provided objective (such as minimization of running time), Condensa uses a novel Bayesian optimization-based algorithm to automatically infer desirable sparsities. Our experiments on four real-world DNNs demonstrate memory footprint and hardware runtime throughput improvements of 188x and 2.59x, respectively, using at most ten samples per search.
NumPy is a popular Python library used for performing array-based numerical computations. The canonical implementation of NumPy used by most programmers runs on a single CPU core and only a few operations are parallelized across cores. This restriction to single-node CPU-only execution limits both the size of data that can be processed and the speed with which problems can be solved. In this paper we introduce Legate, a programming system that transparently accelerates and distributes NumPy programs to machines of any scale and capability typically by changing a single module import statement. Legate achieves this by translating the NumPy application interface into the Legion programming model and leveraging the performance and scalability of the Legion runtime. We demonstrate that Legate can achieve state-of-the-art scalability when running NumPy programs on machines with up to 1280 CPU cores and 256 GPUs, allowing users to prototype on their desktop and immediately scale up to significantly larger machines. Furthermore, we demonstrate that Legate can achieve between one and two orders of magnitude better performance than the popular Python library Dask Array when running comparable programs at scale.
We designed and implemented a CUDA port of the Atari Learning Environment (ALE), a system for developing and evaluating deep reinforcement algorithms using Atari games. Our CUDA Learning Environment (CuLE) overcomes many limitations of existing CPU-based Atari emulators and scales naturally to multi-GPU systems. It leverages the parallelization capability of GPUs to run thousands of Atari games simultaneously; by rendering frames directly on the GPU, CuLE avoids the bottleneck arising from the limited CPU-GPU communication bandwidth. As a result, CuLE is able to generate between 40M and 190M frames per hour using a single GPU, a finding that could be previously achieved only through a cluster of CPUs. We demonstrate the advantages of CuLE by effectively training agents with traditional deep reinforcement learning algorithms and measuring the utilization and throughput of the GPU. Our analysis further highlights the differences in the data generation pattern for emulators running on CPUs or GPUs. CuLE is available at https://github.com/NVlabs/cule.
Read more on my complete list of publications.