4|c| NX=64 NW=64 | |||
Elapsed Time (sec) | CM Time (sec) | CM Mflops/sec | |
Compute RHS | 1.34 | 1.34 | 4.3 |
Tridiagonal solver | 0.278 | 0.278 | 80 |
Lens term | 0.052 | 0.052 | 30 |
Imaging | 0.94 | 0.94 | 0.6 |
Loop total | 3.29 | 3.29 | 9 |
Program total | 6.23 | 5.01 | |
Convex cpu time (sec) | 3|c|8.2 |
4|c| NX=128 NW=128 | |||
Elapsed Time (sec) | CM Time (sec) | CM Mflops/sec | |
Compute RHS | 5.19 | 5.12 | 9 |
Tridiagonal solver | 1.44 | 1.11 | 162 |
Lens term | 0.206 | 0.206 | 61 |
Imaging | 3.98 | 3.98 | 1.1 |
Loop total | 13.4 | 12.7 | 19 |
Program total | 23 | 19 | |
Convex cpu time (sec) | 3|c|52 |
4|c| NX=256 NW=256 | |||
Elapsed Time (sec) | CM Time (sec) | CM Mflops/sec | |
Compute RHS | 18.6 | 18.6 | 20 |
Tridiagonal solver | 4.45 | 4.45 | 326 |
Lens term | 0.819 | 0.819 | 123 |
Imaging | 13.7 | 13.7 | 2.5 |
Loop total | 47.4 | 47.0 | 42 |
Program total | 71.5 | 69.6 | |
Convex cpu time (sec) | 3|c|504 (8:24) |
4|c| NX=128 NW=1024 | |||
Elapsed Time (sec) | CM Time (sec) | CM Mflops/sec | |
Compute RHS | 63.9 | 63.9 | 46 |
Tridiagonal solver | 18.0 | 17.8 | 648 |
Lens term | 3.27 | 3.27 | 246 |
Imaging | 42.4 | 42.3 | 6.3 |
Loop total | 158 | 158 | 98 |
Program total | 239 | 235 | |
Convex cpu time (sec) | 3|c|3284 (54:45) |
Not surprisingly, the program performs best when the number of frequencies is large, and we are able to keep all the processors busy. For the 1024x128 problem, the CM is roughly 14 time faster than the Convex.
A (:NEWS,:NEWS) version of the program is marginally faster on smaller problems because it uses more of the available processors. It is slower on the 1024x128 problem because it requires a lot of inter-processor communication that the (:SERIAL,:NEWS) version avoids. For a problem such as this, where I know that most inter-processor communication takes place along one axis, Fortran 90 allows me to ``weight'' the axes based on the amount of communication expected along them. Giving a weight of five to the x axis gives speeds of 18, 32, 45, and 74 Mflops/sec for the depth loops on the four problems presented above.
The tridiagonal solver performs quite well. The poorest performer is imaging. This is not too surprising, since imaging is the one place in the code where the frequencies are not dealt with independently. Inter-processor communication must take place in the imaging step, in order to sum the data from different frequencies together. This communication is the cause of the poor performance.
The speed of 98 Mflops/sec on the 1024x128 problem, though 14 times faster than the equivalent Convex program, was somewhat less than I had hoped to obtain. I tried various ways to speed up the program. These modifications are described in the following section.