next up previous print clean
Next: CONCLUSION Up: Kneib: Velocity Analysis on Previous: Programming on the CM


FD-schemes or simple matrix multiplications run very fast on the CM. I chose Jon Claerbout's new pixel-precise velocity analysis program (Claerbout, 1990) for an implementation. The program ``Vspray'' has the following loop structure:

Besides calculating the RMS velocity from a seismogram, ``Vspray'' performs a smearing along the velocity axis. The use of four combined do-loops should make parallelization very effective, but the access of different array elements with four different indices in the innermost loop should lead to extensive interprocessor communication. In my opinion, these features make ``Vspray'' an interesting choice for a test. ``Vspray'' is relatively costly on serial computers, because the computing time is roughly proportional to Nt2Nx/2.
The following shows my first trial, ``Vspray1''. The algorithm is unchanged but is simply translated to CM Fortran; arrays are used for variables needed in the innermost loops. Via compiler directive all arrays are allocated on the CM and are completely parallel, using the so-called NEWS geometry that places subsequent array elements in a nearest neighbor fashion together. Only the offset dimension is serial. This exception means that each time slice is present only in one virtual processor. Making the offsets parallel increases the amount of interprocessor communication and slows down the program. Some table variables are aligned with the input tx(), to save communication time. Aligned arrays have the same ``home''; i.e. they are assigned to the same virtual processor.

My second version, ``Vspray2'', reduces the communication considerably but adds some serialness. Input and output array are aligned with each other. Some table arrays are aligned to them in the same way as in the previous example. The offset and slowness dimensions are serial; i.e. they are stored in one virtual processor. Every time slice is copied to every processor memory so that for given $\tau$ each slowness value can be computed independently within each processor. The number of virtual processors is therefore equal to the number of time samples of the input data.

My final version, ``Vspray3'', is probably not optimal but is faster than the second program. Even the offset dimension is now parallel. The amount of interprocessor communication is not reduced by making the program more serial, but by applying the following trick: After computing the slowness and smearing along the slowness axis for each $\tau$-slice, which is done parallel in the time and offset domains, the whole seismogram is shifted along its $\tau$-axis. The result is a ``systolic'' data flow at the end of each loop over $\tau$. This avoids the transmission of the next time slices necessary for computation of the next $\tau$-slice and explains the improvement in performance compared to the last program.

The performances of the three program versions differ considerably. The computing time is not a linear function of the length of the I/O-arrays as it is approximately for serial computers, when swapping can be avoided. This is not surprising because of the nonlinear influence of the communication pattern. To provide a more objective comparison, a vectorized version, ``Vspray4'', was run on our Convex C1-XP. Computations made on that machine and on 8k processors on a 32k Connection Machine in Cambridge, Massachusetts, are compared in Table 1. The absolute CPU times shouldn't be taken too seriously because the actual run time depends partly on how many people are working on the front end.

6|l| Table 1. Performance Comparison          
Parameters Vspray1 Vspray2 Vspray3 Vspray4  
nt=200 361 sec 86 sec 48 sec 9 sec run time
nx=20 1 1 1 1 X first row run time
ns=20 40 10 5 1 X Convex run time
nt=1000 11541 sec 2199 sec 1479 sec 564 sec run time
nx=64 32 26 31 63 X first row run time
ns=64 20 4 2.6 1 X Convex run time
nt=2000 several hours 4118 sec 2930 sec 2237 sec run time
nx=64 unknown 48 61 250 X first row run time
ns=64 unknown 2 1.3 1 X Convex run time
nt=8000 several days(?) 8815 sec 7633 sec 18943 sec run time
nx=32 unkown 102 159 2104 X first row run time
ns=32 unknown 0.5 0.4 1 X Convex run time
Nevertheless the relative computing times are interesting. The table tells us that the advantage of using the CM generally increases with the size of the task, but turns to a disadvantage if the data set is too small. The parallelization is only effective if the parallel array dimensions are at least as large as the number of physical processors.

It should be mentioned that if processing many shotgathers -not only one, like I did- it is better to compute the slowness as function of $\tau$, time and offset in advance, store it in a 3-dimensional array, and shift it over the shotgathers. In that case the performance advantage of the CM will increase rapidly.
The huge discrepancies in computing time suggest that the programmer should think very carefully about the optimal solution before he or she implements an algorithm. This is much more important in parallel than in serial computing.

next up previous print clean
Next: CONCLUSION Up: Kneib: Velocity Analysis on Previous: Programming on the CM
Stanford Exploration Project