next up previous [pdf]

Next: Other applications Up: Results Previous: Visualization

Transpose

All transposes can be mapped into a single five-dimensional template ( $n1,n2,n3,n4,n5$). The first axis is the the product of the data element size and the length of the fastest axes that are not be transposed over. The second axis is the fastest axis that needs to be transposed. The third axis is the product of the axes lengths between transposed axes. The fourth axis is the slowest axis that we want to transpose over. The fifth axis is the product of all of the axes slower than the last axis we wish to transpose over. We can extend the definition of the second and fourth axes to compose sets of adjacent axes to enable more sophisticated transposes. For example, in the case of wave equation migration, we begin with a five dimensional dataset with element size $esize$ that is ordered (midpoint x, midpoint y, hx, hy, depth) or of size ( $ncmpx,ncmpy,nhx,nhy,nz$) and we wish to end up with ( $nz,nhx,nhy,ncmpx,ncmpy$). We can simulate merging the first two axes of the input and map to our template ( $esize,ncmpx*ncmpy,nhx*nhy,nz,1$).

Using this template, the transpose algorithm can be written in a push (loop over input) or pull (loop over output) manner. For EcoRAM , and in many other cases (such as wanting to pipe the output), the pull method is more efficient. The basic transpose algorithm using the five dimensional template then takes the following form.
\begin{algorithm}
% latex2html id marker 29\caption{Simple transpose}
\begin{a...
...[iin],n1)
\ENDFOR
\ENDFOR
\ENDFOR
\ENDFOR
\end{algorithmic} \end{algorithm}
This simple algorithm assumes that you can hold both the input and output matrices in RAM. A problem with this simple approach is the very poor use of input cache lines for small $n1$.

If we can hold $n1*n2*n3*n4$ is memory we can get acceptable performance with either algorithm 4 or slight modification that processes each $i5$ block in turn. We could still use the basic template for large problems by mmapping the input and output file but the cache miss problem would be further exacerbated. A better alternative is to introduce two temporary buffers, tin and tout. These buffers are of size $n1*n4*n3*nmx$, where $nmx$ is chosen so that the combined size of tin and tout does not exceed DRAM% latex2html id marker 419
\setcounter{footnote}{1}\fnsymbol{footnote}. The buffered algorithm then takes the following form.
\begin{algorithm}
% latex2html id marker 46\caption{Simple transpose}
\begin{a...
...DFOR
\STATE $i2=i2+nbuf$
\ENDWHILE
\ENDFOR
\end{algorithmic} \end{algorithm}
The larger $nmx$, the better the performance. Figure 1 shows that even a RAM-based system benefits from the buffered approach. Note how we can gain a performance advantage of greater than six with larger $n1*nbuf$ sizes.

ram
Figure 1.
The number of elements per second vs $log(nbuf*n1)$ that can be read using a completely in-core solution. The problem size, using the generic template, is ( $1,500,72,131072,1$). Note how we can achieve a factor of six improvement by better cache line use. $\mathbf{[NR]}$
ram
[pdf] [png]

Figure 2 compares the performance of EcoRAM versus RAM. Note the similarity of the two curves. The `*' in the figure shows the comparable disk approach. The disk approach is $\frac{1}{10}$th speed of the slowest EcoRAM result and $\frac{1}{200}$th the optimal buffer choice.

disk
Figure 2.
The number of elements per second vs. $log(nbuf*n1)$. Note the `*' indicating the disk IO performance. The problem size, using the generic template, is ( $1,500,72,131072,1$). $\mathbf{[NR]}$
disk
[pdf] [png]

As a final test, we transposed a float dataset of size $(672,216,72,1,2000)$ switching axes 1,2 with axis 5. Using the buffered approach of algorithm 5, a conventional disk took 1293 minutes (using an intermediate buffer size of 1 GB) while the same dataset took 22 minutes using EcoRAM , a $60x$ performance improvement.


next up previous [pdf]

Next: Other applications Up: Results Previous: Visualization

2009-05-05