previous up next print clean
Next: Conclusions Up: Biondi: Wave-equation algorithms on Previous: KIRCHHOFF ALGORITHMS

TIMINGS ON THE SEP CONNECTION MACHINE

I implemented the algorithms presented in the previous sections on the SEP Connection Machine. The SEP CM has 8K serial bit processors, divided into two sections of 4K each (the largest CMs have 64K processors). For every 32 serial bit processors there is a Floating Point Unit (FPU), for a total of 256 FPUs (TMC, 1991). The most recent version of CM Fortran compiler (CMF 1.0) generates code executed directly by the FPUs, therefore the SEP CM is for the Fortran programmer a 256 processors computer, with 256 Kbytes of local memory per processor.

I timed the kernels of the algorithms on a 4K processors (128 FPUs) section of the machine and estimated the flop rates; all the algorithms presented run efficiently on the CM. When evaluating the performances of the tested programs it is advisable to take into account that they are just the kernels of simplified algorithms. They do not include I/O times (SEP has no DataVault, the CM's disk system) and other ``real life'' overheads. On the other hand, the CM at SEP has small memory; each processor has only 1/16th of the maximum possible memory (4 Mbytes per FPU). The small size of the processors' memory affects negatively the speed of the algorithms that require nearest neighbor communications (CSHIFT). The smaller the processors' memory is, the larger are the relative overheads of communications, because it becomes less favorable the ratio between the amount of data that is actually exchanged through the communication network and the amount of data that stays local in memory.

I run the Kirchhoff migration program on a constant offset section with 2048 traces and 1024 time samples per trace. I used a six point long interpolator, and a migration operator 200 traces wide in space. The total run time was 20.1 seconds, corresponding to 245 Mflop/s (Mflop = million of floating point instruction). Theoretically the speed of the same program on a 64K CM should scale up to 3.920 Mflop/s, but I did not tested it yet.

The finite-difference program takes .022 seconds for a each time step, when the dimensions of the computational domain are 512*512, this time corresponds to about 140 Mflop/s. For the Connection Machine there is also an experimental Fortran compiler, called the ``Stencil Compiler'', that generates faster code to perform convolutions of the kind required by finite-differences. When I compiled the five point differencing star with this new compiler each time steps took only .0084 seconds, equivalent to a computational rate of 373 Mflop/s (5.968 Mflop/s on 64K CM).

Finally, I timed the phase-shift migration. The data cube had 256 time samples, 256 shot locations, and 64 offsets. To perform the FFTs I used the complex-to-complex routine belonging to the Connection Machine Scientific Software Library (CMSSL). The transformation of a 128*256*64 data cube took 1.7 seconds on the SEP CM, correspondent to a flop rate of 129.5 Mflop/s. The downward continuation and imaging, not including the FFTs, took 47.9 seconds. In this case, the methodology used for converting the timing measures to flop rates must be more clearly explained because the inner loop of the phase shift migration includes square roots and trigonometric functions. The relative speed of these functions is hardware dependent; on the CM FPU the cycles taken to perform a square root are about about 8.4 times the cycles taken for a floating point operation (hardware square root). The trigonometric functions, that are computed in software, are equivalent to about 46 floating point operations. Using this conversion factors, the phase-shift migration runs at a flops rate of 330 Mflop/s (equivalent to 5.280 Mflop/s on a 64 CM).


previous up next print clean
Next: Conclusions Up: Biondi: Wave-equation algorithms on Previous: KIRCHHOFF ALGORITHMS
Stanford Exploration Project
12/18/1997