previous up next print clean


The following tables describe the performance of the algorithm on problems of varying sizes. The elapsed and Connection Machine time and the speed (in Megaflops/sec) are listed for the loop over depths, for each of the key components of that loop, and for the program as a whole (time only). The speeds are computed by counting the number of floating point operations performed in each part of the code, then dividing by the time taken. Probably the most useful number is the speed of the loop, which is printed in boldface. Also shown is the cpu time on our Convex C-1 for the equivalent well-vectorized problem. NX is the number of points on the x axis, NW is the number of frequencies.

4|c| NX=64 NW=64      
  Elapsed Time (sec) CM Time (sec) CM Mflops/sec
Compute RHS 1.34 1.34 4.3
Tridiagonal solver 0.278 0.278 80
Lens term 0.052 0.052 30
Imaging 0.94 0.94 0.6
Loop total 3.29 3.29 9
Program total 6.23 5.01  
Convex cpu time (sec) 3|c|8.2    

4|c| NX=128 NW=128      
  Elapsed Time (sec) CM Time (sec) CM Mflops/sec
Compute RHS 5.19 5.12 9
Tridiagonal solver 1.44 1.11 162
Lens term 0.206 0.206 61
Imaging 3.98 3.98 1.1
Loop total 13.4 12.7 19
Program total 23 19  
Convex cpu time (sec) 3|c|52    

4|c| NX=256 NW=256      
  Elapsed Time (sec) CM Time (sec) CM Mflops/sec
Compute RHS 18.6 18.6 20
Tridiagonal solver 4.45 4.45 326
Lens term 0.819 0.819 123
Imaging 13.7 13.7 2.5
Loop total 47.4 47.0 42
Program total 71.5 69.6  
Convex cpu time (sec) 3|c|504 (8:24)    

4|c| NX=128 NW=1024      
  Elapsed Time (sec) CM Time (sec) CM Mflops/sec
Compute RHS 63.9 63.9 46
Tridiagonal solver 18.0 17.8 648
Lens term 3.27 3.27 246
Imaging 42.4 42.3 6.3
Loop total 158 158 98
Program total 239 235  
Convex cpu time (sec) 3|c|3284 (54:45)    

Not surprisingly, the program performs best when the number of frequencies is large, and we are able to keep all the processors busy. For the 1024x128 problem, the CM is roughly 14 time faster than the Convex.

A (:NEWS,:NEWS) version of the program is marginally faster on smaller problems because it uses more of the available processors. It is slower on the 1024x128 problem because it requires a lot of inter-processor communication that the (:SERIAL,:NEWS) version avoids. For a problem such as this, where I know that most inter-processor communication takes place along one axis, Fortran 90 allows me to ``weight'' the axes based on the amount of communication expected along them. Giving a weight of five to the x axis gives speeds of 18, 32, 45, and 74 Mflops/sec for the depth loops on the four problems presented above.

The tridiagonal solver performs quite well. The poorest performer is imaging. This is not too surprising, since imaging is the one place in the code where the frequencies are not dealt with independently. Inter-processor communication must take place in the imaging step, in order to sum the data from different frequencies together. This communication is the cause of the poor performance.

The speed of 98 Mflops/sec on the 1024x128 problem, though 14 times faster than the equivalent Convex program, was somewhat less than I had hoped to obtain. I tried various ways to speed up the program. These modifications are described in the following section.

previous up next print clean
Stanford Exploration Project