Performance

Next: IMPROVEMENTS Up: FIRST ATTEMPT Previous: FIRST ATTEMPT

Performance

The following tables describe the performance of the algorithm on problems of varying sizes. The elapsed and Connection Machine time and the speed (in Megaflops/sec) are listed for the loop over depths, for each of the key components of that loop, and for the program as a whole (time only). The speeds are computed by counting the number of floating point operations performed in each part of the code, then dividing by the time taken. Probably the most useful number is the speed of the loop, which is printed in boldface. Also shown is the cpu time on our Convex C-1 for the equivalent well-vectorized problem. NX is the number of points on the x axis, NW is the number of frequencies.

4\|c\| NX=64 NW=64
	Elapsed Time (sec)	CM Time (sec)	CM Mflops/sec
Compute RHS	1.34	1.34	4.3
Tridiagonal solver	0.278	0.278	80
Lens term	0.052	0.052	30
Imaging	0.94	0.94	0.6
Loop total	3.29	3.29	9
Program total	6.23	5.01
Convex cpu time (sec)	3\|c\|8.2

4\|c\| NX=128 NW=128
	Elapsed Time (sec)	CM Time (sec)	CM Mflops/sec
Compute RHS	5.19	5.12	9
Tridiagonal solver	1.44	1.11	162
Lens term	0.206	0.206	61
Imaging	3.98	3.98	1.1
Loop total	13.4	12.7	19
Program total	23	19
Convex cpu time (sec)	3\|c\|52

4\|c\| NX=256 NW=256
	Elapsed Time (sec)	CM Time (sec)	CM Mflops/sec
Compute RHS	18.6	18.6	20
Tridiagonal solver	4.45	4.45	326
Lens term	0.819	0.819	123
Imaging	13.7	13.7	2.5
Loop total	47.4	47.0	42
Program total	71.5	69.6
Convex cpu time (sec)	3\|c\|504 (8:24)

4\|c\| NX=128 NW=1024
	Elapsed Time (sec)	CM Time (sec)	CM Mflops/sec
Compute RHS	63.9	63.9	46
Tridiagonal solver	18.0	17.8	648
Lens term	3.27	3.27	246
Imaging	42.4	42.3	6.3
Loop total	158	158	98
Program total	239	235
Convex cpu time (sec)	3\|c\|3284 (54:45)

Not surprisingly, the program performs best when the number of frequencies is large, and we are able to keep all the processors busy. For the 1024x128 problem, the CM is roughly 14 time faster than the Convex.

A (:NEWS,:NEWS) version of the program is marginally faster on smaller problems because it uses more of the available processors. It is slower on the 1024x128 problem because it requires a lot of inter-processor communication that the (:SERIAL,:NEWS) version avoids. For a problem such as this, where I know that most inter-processor communication takes place along one axis, Fortran 90 allows me to ``weight'' the axes based on the amount of communication expected along them. Giving a weight of five to the x axis gives speeds of 18, 32, 45, and 74 Mflops/sec for the depth loops on the four problems presented above.

The tridiagonal solver performs quite well. The poorest performer is imaging. This is not too surprising, since imaging is the one place in the code where the frequencies are not dealt with independently. Inter-processor communication must take place in the imaging step, in order to sum the data from different frequencies together. This communication is the cause of the poor performance.

The speed of 98 Mflops/sec on the 1024x128 problem, though 14 times faster than the equivalent Convex program, was somewhat less than I had hoped to obtain. I tried various ways to speed up the program. These modifications are described in the following section.

Next: IMPROVEMENTS Up: FIRST ATTEMPT Previous: FIRST ATTEMPT

Stanford Exploration Project
12/18/1997