IMPROVEMENTS

In this section, I describe the modifications that I made, and show the improvement obtained by each. I show the timings at each step for the 128x1024 problem.

First, notice in the original CM code that I re-compute the diagonal matrices for the tridiagonal solver at each depth step. The reason I did this was because the solver overwrites these matrices. A less costly alternative to re-computing is to save a copy of the matrices and restore from this saved copy each time. These copies are the matrices ACOPY, BCOPY, and CCOPY in the new program. Here is the performance after this modification:

4\|c\|NX=128 NW=1024
	Elapsed Time (sec)	CM Time (sec)	CM Mflops/sec
Compute RHS	63.9	63.9	46
Tridiagonal solver	17.8	17.8	648
Lens term	3.27	3.27	246
Imaging	42.3	42.3	6.4
Loop total	132	132	117
Program total	212	209

The improvement is all in the ``Loop total'', since the computation of the diagonal matrices was not included in any of the parts of the loop timed individually.

The computation of the right-hand side in the old version, using FORALL statements, was not particularly efficient or easy to read. Fortran 90 provides functions CSHIFT and EOSHIFT to shift an array along a particular axis. CSHIFT performs a circular shift (with wraparound) while EOSHIFT does an end-off shift. EOSHIFT is what we want. It gives, in effect, zero-value boundary conditions. After trying it I noticed the following improvement:

4\|c\|NX=128 NW=1024
	Elapsed Time (sec)	CM Time (sec)	CM Mflops/sec
Compute RHS	32.1	32.1	92
Tridiagonal solver	17.8	17.8	648
Lens term	3.27	3.27	246
Imaging	42.3	42.3	6.4
Loop total	101	101	154
Program total	180	177

Then, when Biondo Biondi suggested that CSHIFT is faster than EOSHIFT in the current CM compiler, I used it instead and noticed an even larger improvement:

4\|c\|NX=128 NW=1024
	Elapsed Time (sec)	CM Time (sec)	CM Mflops/sec
Compute RHS	10.8	10.8	274
Tridiagonal solver	17.8	17.8	648
Lens term	3.27	3.27	246
Imaging	42.2	42.2	6.4
Loop total	79.3	79.3	196
Program total	158	156

Using CSHIFT gives circular boundary conditions. But I eliminated the unwanted wraparound in the shift by placing zeroes at the ends of the RA and RC right-hand side tridiagonal coefficient matrices, so the effect is the same as using EOSHIFT.

It may be that this code is still not optimal. Generally speaking EOSHIFT and CSHIFT operations should be used only on parallel axes. When used in the serial direction, as they are here, they may not be as efficient as a plain Fortran loop. For the problems tried thus far, however, I have found this method to be the fastest.

Finally the speed of imaging was improved by two more modifications suggested by Biondo. First, I had the imaging step always sum into the z=1 level of the output image rather than summing into the correct place. The fact that the destination of the sum is a constant location helps the compiler to optimize inter-processor communication. I then perform a CSHIFT to cycle the depth levels around to make sure that the z=1 level is free for the next imaging step. This operation is summarized in Figure

Along with this modification, I changed the size of one of the axes of the output image from the number of depths to the number of frequencies. This change forces the compiler to align the output image with the other arrays whose size depends on the number of frequencies (most of the arrays in the program). The resulting performance:

4\|c\|NX=128 NW=1024
	Elapsed Time (sec)	CM Time (sec)	CM Mflops/sec
Compute RHS	10.8	10.8	274
Tridiagonal solver	17.8	17.8	648
Lens term	3.28	3.28	246
Imaging	13.0	13.0	21
Loop total	50	50	311
Program total	129	127

The code is now running at more than 300 Mflops/sec, and is roughly 25 times faster than a well-vectorized Convex version. The imaging step is by far the slowest, and accounts for more than one fourth of the time spent in the depth loop.

With these fixes, there is now essentially no difference, for large problems, between the speed of the (:SERIAL,:NEWS) version discussed here and a (:NEWS,:NEWS) version where the axes are weighted to favor communication along the x axis (with a weighting factor of five). With this weight, the (:NEWS,:NEWS) achieves the same 311 Mflops/sec on the 1024x128 problem. On the 256x256 problem (as well as smaller problems) it is slightly faster. It runs at 158 Mflops/sec for the 256x256 problem, compared to 136 Mflops/sec for the (:SERIAL,:NEWS) version. On the smallest problem studied here, 64x64, the (:NEWS,:NEWS) version runs at 52 Mflops/sec, compared with 33 Mflops/sec for the (:SERIAL,:NEWS) program.

It may seem puzzling at first that the performance is poorer for smaller problems when using a (:NEWS,:NEWS) scheme. For a 64x64 problem, the product of NX and NW is 4096, far more than the number of available processors. Surely all the processors are being used, unlike with a (:SERIAL,:NEWS) ordering, where only 64 processors would be used. If so, why then are the speeds so similar? The answer is that even though all processors are being used in the (:NEWS,:NEWS) case, much time is being spent on inter-processor communication. In the 256x256 case, since there are 128 floating point processors, each one of these will contains 512 data points. Because of the axis weighting, these will likely be adjacent points in x for some particular frequency. The effect is much like a (:SERIAL,:NEWS) ordering, so the performance is similar. With a 64x64 problem, there are only 32 points per physical processor. The smaller problem is ``spread out'' over all the processors. This means that, compared to the larger problem, a greater fraction of the time is spent on inter-processor communication. Thus the resulting speed is slower, as is the speed we would get if we packed all the points into fewer processors in a (:SERIAL,:NEWS) ordering. For small problems, you just can't win.

My conclusion is that for this problem, a (:NEWS,:NEWS) specification with the proper weight on the x axis is the best general layout. It performs as well as a (:SERIAL,:NEWS) ordering for large problems, and slightly better for smaller problems.

My emphasis on the problem with 1024 frequencies and only 128 points in x may seem inappropriate. I would guess that a typical migration has as many or more points in x than $\omega$ .My reasons for choosing it had to do with memory limitations I have experienced on the CM, some of which are still mysterious. I have found myself limited to a total problem size of no more than 1024x128 or 128K points (though in theory the current program should have room for around 1024x1024 points on our CM). Within this limitation, the best speeds are obtained if we have more points along the frequency axis, since it entails less inter-processor communication. I believe that, given enough memory, speeds such as what I reported should be obtainable for 1024 frequencies and x axes in excess of 1024 points. The key is just the available memory. So I chose a somewhat skewed example in order to report what I think are speeds that can realistically be obtained on more reasonably sized problems, if only the memory is there.