I timed the kernels of the algorithms on a 4K processors (128 FPUs)
section of the machine and estimated the flop rates;
all the algorithms presented run efficiently on the CM.
When evaluating the performances of the tested programs it is
advisable to take into account that they are just the kernels of
simplified algorithms. They do not include
I/O times (SEP has no DataVault, the CM's disk system)
and other ``real life'' overheads.
On the other hand, the CM at SEP has small memory;
each processor has only 1/16th of the maximum possible
memory (4 Mbytes per FPU).
The small size of the processors' memory affects negatively
the speed of the algorithms that require
nearest neighbor communications (** CSHIFT**).
The smaller the processors' memory is,
the larger are the relative overheads of communications,
because it becomes less favorable the ratio between the amount
of data that is actually exchanged through
the communication network
and the amount of data that stays local in memory.

I run the Kirchhoff migration program on a constant offset section with 2048 traces and 1024 time samples per trace. I used a six point long interpolator, and a migration operator 200 traces wide in space. The total run time was 20.1 seconds, corresponding to 245 Mflop/s (Mflop = million of floating point instruction). Theoretically the speed of the same program on a 64K CM should scale up to 3.920 Mflop/s, but I did not tested it yet.

The finite-difference program takes .022 seconds for a each time step, when the dimensions of the computational domain are 512*512, this time corresponds to about 140 Mflop/s. For the Connection Machine there is also an experimental Fortran compiler, called the ``Stencil Compiler'', that generates faster code to perform convolutions of the kind required by finite-differences. When I compiled the five point differencing star with this new compiler each time steps took only .0084 seconds, equivalent to a computational rate of 373 Mflop/s (5.968 Mflop/s on 64K CM).

Finally, I timed the phase-shift migration. The data cube had 256 time samples, 256 shot locations, and 64 offsets. To perform the FFTs I used the complex-to-complex routine belonging to the Connection Machine Scientific Software Library (CMSSL). The transformation of a 128*256*64 data cube took 1.7 seconds on the SEP CM, correspondent to a flop rate of 129.5 Mflop/s. The downward continuation and imaging, not including the FFTs, took 47.9 seconds. In this case, the methodology used for converting the timing measures to flop rates must be more clearly explained because the inner loop of the phase shift migration includes square roots and trigonometric functions. The relative speed of these functions is hardware dependent; on the CM FPU the cycles taken to perform a square root are about about 8.4 times the cycles taken for a floating point operation (hardware square root). The trigonometric functions, that are computed in software, are equivalent to about 46 floating point operations. Using this conversion factors, the phase-shift migration runs at a flops rate of 330 Mflop/s (equivalent to 5.280 Mflop/s on a 64 CM).

12/18/1997