next up previous [pdf]

Next: Conclusion Up: Moussa: GPGPU RTM Parallelism Previous: Implementation

Performance and Benchmark Summary

For the sake of simplicity and consistency, I tested my RTM code on synthetic data. I used a simple subsurface velocity model with a few reflecting layers. This same velocity model has been used by other SEP students and researchers, and although it does not represent the complex subsurface behavior of a real earth model, it provides sufficient complexity to evaluate the correct functionality of the RTM implementation. My current work to integrate SEPlib with the GPGPU environment will enable benchmarking and testing on more standard data, eventually including field recorded data sets. This will be an important step to verify and compare GPGPU performance to more traditional paralellism schemes.

Unless otherwise noted, the benchmark results I report were computed on a two dimensional wavefield space, with grid size 1,000 x 1,000.

rtmImagingPreliminary
rtmImagingPreliminary
Figure 4.
Preliminary RTM image results on a synthetic data set with a few simple horizontal reflectors. This test verified functionality of my preliminary RTM implementation on the GPGPU system. Wave diffraction is visible at the corners, probably due to the unrealistic, abrupt end of the layers in this synthetic model. [NR]
[pdf] [png]

GPU execution time is shown in Figure 5 for a 1,000,000 point grid, (1,000 x 1,000 2D computational space). It is compared to a serial implementation of RTM on the CPU. Due to time constraints, I was not able to compare the GPGPU parallelization to other parallel RTM versions.

totalExecutionTimeChart
totalExecutionTimeChart
Figure 5.
Total execution time for RTM imaging, comparing serial implementation on CPU (executed on the ProLiant Xeon host), compared to a single GPU CUDA parallelization. [NR]
[pdf] [png]

Evidently, GPU parallelization has a dramatic effect on the total execution time, reducing it by a factor of more than 10x. With 240x as many cores, however, this is sublinear parallelization. Closer profiling of the CUDA algorithm execution time revealed the computational breakdown shown in Figure 6. This profiling was accomplished using timer variables compiled in the device code, as standard code profilers have difficulty working with the GPGPU environment. Most of the bottleneck is clearly the memory transfers between host and device, which are required for the imaging condition. The primary focus of further research is to work around this limitation: first, by optimizing the memory transfers as much as possible; and more importantly, by developing numerical schemes that can perform the imaging step without as much expensive transfer overhead.

execTimePieChart
execTimePieChart
Figure 6.
Breakdown of program execution time for the CUDA implementation. Very little time is spent executing numerical processing code (wave propagation or imaging condition). The vast majority of time is spent in host-device memory transfer over the PCI-e bus (between the ProLiant CPU system and the Nvidia Tesla S1070). [NR]
[pdf] [png]


next up previous [pdf]

Next: Conclusion Up: Moussa: GPGPU RTM Parallelism Previous: Implementation

2009-05-05