Courses / Activities(a) Programming inverse memory hierarchy: case of stencils on GPUs(b) Understanding performance bottlenecks in numerical kernels on GPUs
reads
Vasily Volkov
2010-05-21
14:20:00 - 17:20:00
401 , Freshman Classroom Building
Efficient management of memory resources is one of the central problems in high performance computing. It implies efficient message passing in the distributed memory model, resolving data races in the shared memory model and reducing bandwidth requirements using blocks or tiles in the multi-tier memory model. In this respect, novel memory architectures such as those found on emerging many-core processors pose a new challenge. For example, NVIDIA Fermi processor has large 2 MB register file, smaller 1 MB L1 cache/shared memory and even smaller 768 KB L2 cache, which is an inverse structure compared to the well-understood processor architectures that have small, order of one kilobyte register file, dozens of kilobytes in L1 caches and a few megabytes in L2 cache. Programming such a new architecture may require designing novel algorithms if high efficiency is required. In this talk I discuss programming G80 and GT200 processors that have two tiers of inverse memory hierarchy – large register file and smaller shared memory. I argue that blocks in the tiled algorithms on such systems should be stored in the upper, larger levels of the memory hierarchy such as registers instead of the traditionally used lower, smaller levels such as caches or local stores. I consider in detail a design of high performance finite difference kernels such as 7- and 27-point Jacobi and red-black Gauss-Seidel 3D stencils. I show that register-based algorithms yield 2x speedups compared to other highly optimized solutions run on the same processors, such as based on using texture caches.