next up previous [pdf]

Next: Ackowledgments Up: Leader and Clapp: Linearised Previous: Domain decomposition


We conclude that GPUs can be effective when running linearised inversion. By storing a velocity stencil in shared memory, as well as the wavefield values, we can perform accelerated adjoint propagation. By augmenting this with a random boundary based RTM scheme and a CPU based model stepper we can perform least squares iterative linearised inversion very efficiently, with little time lost for data movement.

At the point when our domain size exceeds our GPU memory it is possible to decompose this domain across multiple devices, whether one has multiple GPUs per node, multiple nodes with single GPUs, or any combination thereof. By making halo communication calls overlap with internal data computation it is largely possible to hide all communication, meaning our time scaling with model size is still linear.

Combining these schemes gives us the potential to perform large scale inversion at a high level of fine grain parallelism.