![]() |
![]() |
![]() |
![]() | Selecting the right hardware for Reverse Time Migration | ![]() |
![]() |
From the above list of bottlenecks, one could conclude that optimization and algorithmic techniques are concentrated on reducing the total number of operations and figuring out how to minimize the need to go to disk. This is an incomplete and, particularly for reducing operation count, a misleading description of the problem. Reducing the total number of operations is important but in most cases the problem is the `cost' of transferring data from where it exists on some memory device to the processing unit. The cost of transferring the data has two parts; the latency, which is the amount of time it takes to send the first byte; and the bandwidth, which is how much data can be sent in a given time. The definition of a memory device has many potential meanings. On the latest Intel CPU (Nehalem) these include L1, L2, and L3 cache; the Random Access Memory (RAM) on the system; the disk, solid state or conventional; and transferring from another processor, or node. Accelerators such as the GPU or FPGA have additional levels of memory devices. The core of many optimizations and algorithmic techniques is to get as much of the data as possible to the least costly memory device before the processor needs it.
To understand how latency and bandwidth can play an important role in seismic modeling, let's
take a closer look at the general structure of algorithm 2
with a simplistic view of a conventional
CPU, where vectorization, prefetching, and pipelining don't exist.
To compute we
need one element in each of the 3-d volumes
and
; and
seven elements in
. Let's ignore the source term
, as it is non-zero
at very few values. So, to compute the very
first element of the array, we need to read 9 values (36 bytes) and store 4 bytes. In reality,
we are going to read significantly more than 36 bytes.
CPUs don't transfer individual bytes from main memory but instead transfer
cache lines which are composed of between 8-256 contiguous bytes (we will use 64 bytes for this example)
to L1 cache. The individual value we
want is then transferred to a register for our given calculation. So, for
this idealized case, we will have the cost
of transferring the following eight cache lines to the L1 cache.
Cache line | Usage |
![]() |
Velocity |
![]() |
Used for time derivative |
![]() |
Used for time and z derivative |
![]() |
Used for y derivative |
![]() |
Used for y derivative |
![]() |
Used for x derivative |
![]() |
Used for x derivative |
![]() |
Output |
The cost will be dominated by the latency associated with
transferring eight cache lines from main memory.
Everything we need to compute the next 15 elements, , is now sitting
on our L1 cache, which tends to have a latency an order of magnitude less than transferring
from main memory. When we hit
we will have to transfer additional
cache lines from main memory to the L1 cache. We will continue in this pattern along
the
axis. At some stage,
the L1 cache becomes full an cache lines will instead be sent to the L2 cache. When the
L2 cache is full, cache lines will be transferred to the L3 cache, then main memory.
Things get more interesting when we begin the
loop. We will definitely
have to read in the new cache lines
,
,
,
and
but the remaining three cache lines needed for the calculation have been
read before and may exist in
L1, L2, L3, or main memory depending on the size of our domain. The larger
our domain, the more likely that the cache lines will have migrated to main memory, costing
us the order of magnitude additional penalty. When we begin the
loop only
three of the cache lines needed (
,
,
and
) haven't been previously read. For higher order spatial
derivatives, the problem is significantly worse. For example, in the case of a 14th
order in space stencil, we need to read 46 different cache lines. The odds
of reuse go down significantly, increasing our cache misses and straining
the total bandwidth needed between our caches and main memory.
![]() |
![]() |
![]() |
![]() | Selecting the right hardware for Reverse Time Migration | ![]() |
![]() |