Visualization and data reordering using EcoRAM |
Using this template, the transpose algorithm can be written in a push (loop
over input) or pull (loop over output) manner. For EcoRAM , and in many other
cases (such as wanting to pipe the output), the pull method is more efficient.
The basic transpose algorithm using the five dimensional template then
takes the following form.
This simple algorithm assumes that you can hold both the input and
output matrices in RAM. A problem with this simple approach is the
very poor use of input cache lines for small .
If we can hold is memory we can get acceptable performance
with either algorithm 4 or slight modification
that processes each block in turn. We could still use the
basic template for large problems by mmapping the input and output file but
the cache miss problem would be further exacerbated. A better alternative
is to introduce two temporary buffers, tin and tout. These buffers
are of size , where is chosen so that the combined
size of tin and tout does not exceed DRAM. The buffered algorithm
then takes the following form.
The larger , the better the performance. Figure 1 shows that even
a RAM-based system benefits from the buffered approach. Note how we can
gain a performance advantage of greater than six with larger sizes.
ram
Figure 1. The number of elements per second vs that can be read using a completely in-core solution. The problem size, using the generic template, is ( ). Note how we can achieve a factor of six improvement by better cache line use. | |
---|---|
Figure 2 compares the performance of EcoRAM versus RAM. Note the similarity of the two curves. The `*' in the figure shows the comparable disk approach. The disk approach is th speed of the slowest EcoRAM result and th the optimal buffer choice.
disk
Figure 2. The number of elements per second vs. . Note the `*' indicating the disk IO performance. The problem size, using the generic template, is ( ). | |
---|---|
As a final test, we transposed a float dataset of size switching axes 1,2 with axis 5. Using the buffered approach of algorithm 5, a conventional disk took 1293 minutes (using an intermediate buffer size of 1 GB) while the same dataset took 22 minutes using EcoRAM , a performance improvement.
Visualization and data reordering using EcoRAM |