Memory limitations

Next: FPGAs Up: GPGPU Previous: Programming model limitations

Memory limitations

As mentioned above, in the GPU hierarchy, registers and shared memory provide two orders of magnitude faster access than the global memory and function in the role of cache in CPU architectures. The size limits of registers and shared memory per block are both 16 KB. While the shared memory is shared among all the threads in the block, the threads keep their own copies of the registers. As a result, when we increase the number of threads in a block to achieve better performance, the size of registers usually becomes the bottleneck. In the CUDA framework, the supported maximum number of threads per block is 512. If the size of the stencil increases from 9 points to 33 points, the number of registers needed for each thread increases from 28 to 55. To stay under the maximum number of threads per block (512), the stencil sizes must be smaller than 17 points. For stencil sizes larger than that, we need to reduce the number of threads per block, thus sacrificing some of the computation performance.

Although global memory access is much slower than registers and shared memory, the bandwidth of global memory (80 GB/s) is much higher than the bandwidth to the host memory through PCI-Express (4 GB/s). However, the size of global memory is usually not enough for the large dimension sizes of seismic problems. Typically, we need to store four arrays (prev, cur, next, v), allowing a domain size of only 250 million points, much smaller than is often needed. For larger domains, we must split over multiple GPUs, creating a bottleneck either over the PCI-X link or the network.

The correlation step is also constrained by memory bandwidth. The fact that the source and receiver wavefields are propagated in different directions is also problematic. Assuming a check-pointing scheme is used, we must continually transfer snapshots of the source wavefield across the PCI-X bus when propagating the wavefield forward and transfer them back while propagating the receiver wavefield. Wavefield buffering that is usually used to minimize the number of checkpoints and the resulting IO load must be spaced much closer together given the relatively small amount of memory on the GPGPUs. Constructing subsurface offset gathers is also hindered by the limited memory. The GPGPU can only create offset gathers at a small percentage of imaging locations before we again become memory limited.

Next: FPGAs Up: GPGPU Previous: Programming model limitations

2009-10-16