Next: 3D Wave Propagation Up: Leader and Clapp: Accelerating Previous: Introduction

General purpose graphics processing units

Using a General Purpose Graphics Processing Unit (GPGPU) one can perform many independent parallel instructions simultaneously - far more than possible using a parallel CPU based system. These are known as SIMD (Single Instruction Multiple Data) devices, as they are capable or running one set of instructions many times over parallel threads. With the release of the programming language CUDA in 2006 Nvidia provided a way of harnessing the power of graphics processing units with limited knowledge of graphics processing.

GPUarch Figure 1. A schematic for the architechture of an Nvidia GPU

The GPU can be thought of as a 2D structure, or grid. This grid is broken into blocks, each block in term consists of a group of threads, Figure 1. Each of these thread-blocks has its own shared memory, which can be unique per thread-block, and it's own set of registers, which are the same between all thread-blocks. Each thread in this hierarchy can execute a set of instructions (a kernel) concurrently, allowing for fine grain parallelism. The true potential of the GPGPU architechture lies in how memory latency can be hidden. The GPU partitions resources using registers and shared memory, these both have a latency of only a few cycles. Mass simultaneous execution effectively hides memory latency and context switching is (essentially) free. Parallel CPU execution, such as that possible when using OpenMP, MPI or POSIX, does not partition resources and features a memory latency at least an order of magnitude higher than the GPU equivalent, making this form of parallelisation vastly less efficient.

A schematic for how thread-block and global memories can interact is shown in Figure 2.

GPUmem Figure 2. A schematic for the memory heirarchies within the GPU

Next: 3D Wave Propagation Up: Leader and Clapp: Accelerating Previous: Introduction

2011-05-24