next up previous [pdf]

Next: Programming model limitations Up: Clapp et al.: Hardware Previous: Cache-oblivious approaches

GPGPU

GPGPUs are a rather new high performance computing platform. Current high-end GPGPUs have hundreds of processing units and total memory bandwidth that is several times greater than Intel and AMD's current architectures. The hardware roadmap for GPGPUs versus CPUs show this disparity increasing significantly in the near future. GPGPUs connect to the motherboard through either the PCI-X slot on the motherboard or directly into the CPU socket on the motherboard. The PCI-X approach has the advantage of being motherboard agnostic. Connecting directly into the CPU socket allows easier, and often quicker, access to system memory, but takes away from the system's total CPU computing power. The most common programming language for GPGPUs is Nvidia's CUDA. An open standard language, OpenCL, will soon be available on both Nvidia and ATI hardware.

GPGPUs are often referred to as ``streaming processors''. In this context, a streaming paradigm means that the same operation, or set of operations, is performed on a stream of data, a form of Single Instruction Multiple Data (SIMD). A GPGPU is broken into a grid of thread blocks. Ideally, the number of thread blocks is equal to greater than the number of processing units on the GPGPU. Multiple threads, up to 512 on the latest Nvidia GPGPU, are started on each grid block. The threads help hide memory latency. While one thread is waiting for data, another thread can be executing. Each thread block has its own set of registers (fastest memory) and each grid block has a pool of shared memory, which can be as fast to access as register. In addition, there is a large pool of global memory (4GB on the latest Nvidia GPGPU). This global memory has latency two orders of magnitude larger than accessing registers or shared memory. Figure 4 shows this two level parallelism.

gpgpu
gpgpu
Figure 4.
The memory hierarchy of typical GPGPU. The processing elements are broken into a 2-D grid block. Each 2-D grid block is broken into a grid of thread blocks. Each thread has its own set of registers. Each grid block can access the same shared memory and grid blocks can access the globabl memory.[NR]
[pdf] [png]

On the GPGPU, the basic acoustic algorithm shown in algorithm 2 is broken into two parts. Replacing the unrealistic 4-D source volume with the previous prev, current pcur, and updated pnext wavefield, the CPU portion of the code has a form similar to algorithm 3.
\begin{algorithm}
% latex2html id marker 97\caption{Second-order acoustic mode...
...rs on {\tt prev, pcur}, and {\tt pnext}
\ENDFOR
\end{algorithmic}\end{algorithm}

The GPGPU update kernel has nearly as simple a form.
\begin{algorithm}
% latex2html id marker 108\par
\caption{Second-order acousti...
...nd time derivatives to form {\tt pnew}
\ENDFOR
\end{algorithmic}\end{algorithm}

GPGPUs are often considered difficult to program. This is partially due to unfamiliarity with the parallelism model and partially due to the need to explicitly handle transfers to and from the GPGPU and at different levels of GPGPU memory in order to achieve maximum performance. For an unoptimized code, this perspective has validity. An optimized GPGPU code most likely contains less lines and less complexity. Once the relative simplicity of the GPGPU kernel parallelism and memory hierarchy is understood, one could argue that the GPGPU offers a quicker development cycle.

Simply looking at the total memory bandwidth available and the raw compute power offered, GPUs would seem to be the obvious best compute device for RTM. Unfortunately, some of the limitations on the programming model and memory hierarchy makes it less clear.


Subsections
next up previous [pdf]

Next: Programming model limitations Up: Clapp et al.: Hardware Previous: Cache-oblivious approaches

2009-10-16