next up previous [pdf]

Next: CPU Up: RTM Previous: Bottlenecks

Modeling costs

From the above list of bottlenecks, one could conclude that optimization and algorithmic techniques are concentrated on reducing the total number of operations and figuring out how to minimize the need to go to disk. This is an incomplete and, particularly for reducing operation count, a misleading description of the problem. Reducing the total number of operations is important but in most cases the problem is the `cost' of transferring data from where it exists on some memory device to the processing unit. The cost of transferring the data has two parts; the latency, which is the amount of time it takes to send the first byte; and the bandwidth, which is how much data can be sent in a given time. The definition of a memory device has many potential meanings. On the latest Intel CPU (Nehalem) these include L1, L2, and L3 cache; the Random Access Memory (RAM) on the system; the disk, solid state or conventional; and transferring from another processor, or node. Accelerators such as the GPU or FPGA have additional levels of memory devices. The core of many optimizations and algorithmic techniques is to get as much of the data as possible to the least costly memory device before the processor needs it.

To understand how latency and bandwidth can play an important role in seismic modeling, let's take a closer look at the general structure of algorithm 2 with a simplistic view of a conventional CPU, where vectorization, prefetching, and pipelining don't exist. To compute $s(2,2,2,3)$ we need one element in each of the 3-d volumes $s(:,:,:,1),w$ and $v$; and seven elements in $s(:,:,:,2)$. Let's ignore the source term $w$, as it is non-zero at very few values. So, to compute the very first element of the array, we need to read 9 values (36 bytes) and store 4 bytes. In reality, we are going to read significantly more than 36 bytes. CPUs don't transfer individual bytes from main memory but instead transfer cache lines which are composed of between 8-256 contiguous bytes (we will use 64 bytes for this example) to L1 cache. The individual value we want is then transferred to a register for our given calculation. So, for this idealized case, we will have the cost of transferring the following eight cache lines to the L1 cache.

Cache line Usage
$v(1:16,2,2)$ Velocity
$s(1:16,2,2,1)$ Used for time derivative
$s(1:16,2,2,2)$ Used for time and z derivative
$s(1:16,1,2,2)$ Used for y derivative
$s(1:16,3,2,2)$ Used for y derivative
$s(1:16,2,3,2)$ Used for x derivative
$s(1:16,2,1,2)$ Used for x derivative
$s(1:16,2,2,3)$ Output

The cost will be dominated by the latency associated with transferring eight cache lines from main memory. Everything we need to compute the next 15 elements, $s(3,2,2,3)$, is now sitting on our L1 cache, which tends to have a latency an order of magnitude less than transferring from main memory. When we hit $s(17,2,1,3)$ we will have to transfer additional cache lines from main memory to the L1 cache. We will continue in this pattern along the $iz$ axis. At some stage, the L1 cache becomes full an cache lines will instead be sent to the L2 cache. When the L2 cache is full, cache lines will be transferred to the L3 cache, then main memory. Things get more interesting when we begin the $iy=3$ loop. We will definitely have to read in the new cache lines $v(1:16,3,2)$, $s(1:16,3,2,1)$, $s(1:16,4,2,2)$, $s(1:16,3,1,2)$ and $s(1:16,3,3,2)$ but the remaining three cache lines needed for the calculation have been read before and may exist in L1, L2, L3, or main memory depending on the size of our domain. The larger our domain, the more likely that the cache lines will have migrated to main memory, costing us the order of magnitude additional penalty. When we begin the $ix=3$ loop only three of the cache lines needed ( $v(1:16,2,3)$, $s(1:16,2,3,1)$, and $s(1:16,2,4,2)$) haven't been previously read. For higher order spatial derivatives, the problem is significantly worse. For example, in the case of a 14th order in space stencil, we need to read 46 different cache lines. The odds of reuse go down significantly, increasing our cache misses and straining the total bandwidth needed between our caches and main memory.


next up previous [pdf]

Next: CPU Up: RTM Previous: Bottlenecks

2009-10-16