Next: Performance Up: Cole: Porting a simple Previous: Introduction

# FIRST ATTEMPT

Because the program is small, it took very little time to implement a first version on the Connection Machine. Here is a piece of my first program, equivalent to the two nested loops over frequency and depth in the program of Figure :
```  DO IZ=1,NZ
A = SPREAD( -AA, DIM = 1, NCOPIES = NX )
B = SPREAD( 1.+2.*AA, DIM = 1, NCOPIES = NX )
C = SPREAD( -AA, DIM = 1, NCOPIES = NX )
C Compute right-hand side
FORALL(J=1:NW) CD(1,J) = AA(J)*CP(2,J) + (1.-2.*AA(J))*CP(1,J)
FORALL(J=1:NW) CD(NX,J) = AA(J)*CP(NX-1,J) + (1.-2.*AA(J))*CP(NX,J)
FORALL(I=2:NX-1,J=1:NW) CD(I,J) = AA(J)*CP(I+1,J) +
&   (1.-2.*AA(J))*CP(I,J) + AA(J)*CP(I-1,J)
C Tridiagonal solver
CALL GEN_TRIDIAG_SOLVE( CD,A,B,C,ONE,TOL,IER)
C Lens term
CP = CD * SHIFTS
C Imaging
P(:,IZ) = SUM(REAL(CP), DIM = 2)
ENDDO
```

Both programs have the four key ingredients of a 2-D implicit finite-difference migration: computing the right-hand side, solving the tridiagonal system, applying the thin lens term, and imaging.

A major difference between the two programs is that on the CM, there is no loop over frequencies. All frequencies are solved in parallel. Each CM processor sets up and solves the tridiagonal system of equations for one frequency. This mapping of the problem onto the processors of the machine is attractive because three of the four components in the algorithm (all except imaging) will then require no inter-processor communication, which should be avoided where possible on the Connection Machine. Figure  illustrates this mapping of the problem onto the hardware of the Connection Machine.

ordering
Figure 1
One mapping of the migration problem onto the Connection Machine. The downward-continued data for a given frequency and all x are stored in a single processor. Computing the right-hand side of the tridiagonal system, solving, and applying the thin lens term are fast - no inter-processor communication is required. Imaging requires a sum over all frequencies and is slower.

For best performance, we need to have enough frequencies to keep all processors busy. On a 4K (single bit) processor machine, there are 128 floating point processors. For hardware reasons, we need to have at least 4 times 128 or 512 frequencies to achieve best performance.

An alternative mapping of the problem is possible. We could make both the frequency and x axes parallel. This is a good thing to do for small problems, where we do not have enough frequencies to keep all processors busy with a (parallel, serial) ordering. A drawback to a (parallel, parallel) ordering is that all parts of the program will then require inter-processor communication. So if there are enough frequencies, it seems that the (parallel, serial) ordering is the way to go.

The mapping is specified by LAYOUT compiler directives. Here are the array declarations for the Fortran 90 code fragment shown above:

```REAL P( NX,NZ )
COMPLEX CP( NX,NW )
COMPLEX A( NX,NW )
COMPLEX B( NX,NW )
COMPLEX C( NX,NW )
COMPLEX CD( NX,NW )
COMPLEX AA( NW )
COMPLEX SHIFTS( NX,NW )
```

and here are the associated layout directives:

```CMF\$LAYOUT P(:SERIAL,:NEWS)
CMF\$LAYOUT CP(:SERIAL,:NEWS)
CMF\$LAYOUT A(:SERIAL,:NEWS)
CMF\$LAYOUT B(:SERIAL,:NEWS)
CMF\$LAYOUT C(:SERIAL,:NEWS)
CMF\$LAYOUT CD(:SERIAL,:NEWS)
CMF\$LAYOUT SHIFTS(:SERIAL,:NEWS)
CMF\$LAYOUT AA(:NEWS)
```

Changing all instances of SERIAL in the layout directives to NEWS would effect a (parallel, parallel) ordering.

The tridiagonal solver routine `GEN_TRIDIAG_SOLVE` solves (in this case) NW tridiagonal systems, each of length NX, in parallel. We supply NW copies of the three diagonal matrices. The SPREAD intrinsic function creates the necessary copies. The routine `GEN_TRIDIAG_SOLVE` is a parallel tridiagonal solver from the Connection Machine Scientific Software Library (CMSSL). On the Convex, I did not use the routine ctris shown in the Ratfor code example, but rather a well-vectorized tridiagonal solver from the Convex VECLIB vector processing library. This insures as fair a comparison between the two machines as possible.

Next: Performance Up: Cole: Porting a simple Previous: Introduction
Stanford Exploration Project
12/18/1997