To achieve maximum GPGPU performance, we need to maximize the number of threads per thread
block and have at least as many grid blocks as the number of processing units. On
the high end GPGPU, this means we would like to have at least 100,000 identical tasks.
Unfortunately,
this eliminates some of the optimization opportunities available on the CPU.
Oblivious cache is impractical and we are limited in our ability to compress model
parameters.
Both of these result in significantly more strains on global memory.
A potentially greater problem is introduced with the boundary condition. The SIMD
parallelism required for performance on the GPU makes complex
boundary conditions like PML and probably
the zero slope boundary condition impractical. As a result, we are limited to using
damping schemes that require us to expand our computational domain to achieve damping
results that are sub-optimal compared to PML schemes.
Selecting the right hardware for Reverse Time Migration