The package is multi-threaded across frequencies for each depth-continuation step. Each thread performs computations independently on a different slice of the data, including the FFTs and the application of the phase-shift operator. The summation over frequency is performed serially, as well as the I/O for both the data and the velocity function. The velocity function is shared (not replicated) between threads.
A few changes needed to be done to the code to make it run correctly on the PGI compiler. The most significant, and annoying of them, was related to F90 allocatable arrays. The PGI compiler requires the program to allocate these arrays when passed as arguments to lower level subroutines, even when they are never used in the computations. A simple work-around, to avoid allocating more memory than necessary, is to allocate these arrays with the axes length of one when they are not used.
To analyze the effects of out-of-cache computations, we run the benchmarks on two different sized problems. A ``small problem'', for which two frequency slices (one for the input and one for the output) fitted into the L2-caches of all computers tested, and a ``large problem'' for which two frequency slices did not fit into the L2-caches of any of the computers tested.
On the SGI platforms, the package usually uses the FFTs provided by SGI mathematical libraries. Obviously, these libraries are not available (yet?) on Linux platforms. Therefore, we downloaded a public domain FFT library called FFTW Frigo and Johnson (1999). This library achieves good performances, and is quite flexible. It supports multi-dimensional FFTs with arbitrary lengths of the axes, and a flag (FFTW_THREADSAFE) can be set during initialization to make safe sharing some pre-computed data (e.g., table of twiddle factors) between threads. We run the benchmarks with both the SGI native FFTs and the FFTW library and show the results for both cases. It turns out that the FFTW library is somewhat slower than the native SGI library, but it achieves respectable performances, on the Power Challenge. On the Origin 200 it is considerable slower than the native library even on a single processor. This slow down is more pronounced for out-of-cache problems than for in-cache ones.
To average-out abnormal temporary system behavior, the tests were run on many depth steps for a total computation time of at least five minutes even for the small problem running on several CPUs.