The coordinate and boundary information is no longer needed, once the operator has been constructed and boundary conditions have been applied. All that remains is to solve the system of linear equations. This can be efficiently done by the CG method. The CM code for the iterations of the CG algorithm is almost a one-to-one match with the algorithm outline given in Equation 1. This is one of the more attractive features of CM-Fortran, there is none of the extra clutter of do loops to obscure the meaning of the code. Figure shows the conjugate gradient loop in its entirety.

DO WHILE( NORMRATIO > ERRLIM )

C Distribute the step information to all the nodes. CALL DISTRIBUTE( STEP, LOCSTEP, NNODEX, NNODEY )

C apply the operator CONJSTEP = 0. DO I =1,9 CONJSTEP = CONJSTEP + OPERATOR(I,:,:) * LOCSTEP(I,:,:) END DO

C calculate alpha ALPHA = PREVNORM / SUM( STEP * CONJSTEP )

C update soln SOLN = SOLN + ALPHA * STEP

C update residuals RESID = RESID - ALPHA * CONJSTEP

C calculate norm of residuals and beta RESIDNORM = SUM( RESID*RESID ) BETA = RESIDNORM/PREVNORM

C update the step STEP = RESID + BETA * STEP

C calculate relative error in residuals. NORMRATIO = RESIDNORM/RHSNORM

PREVNORM = RESIDNORM

END DO

CM-Fortran code for the CG loop

The most important point about this code is that there is only one distribution step per iteration. This is the minimum possible amount of communication for the CG algorithm. The only other interprocessor communication is in the ``SUM'' operation which is used to calculate the inner products.

The layout of the arrays and the coding of the distribution routine depend on the connectivity of the nodes. In the model problem the nodes can be mapped to a 2-D array and the distribution can be performed as a series of shift operations. Figure shows the CM code to define the array layout and Figure the routine ``DISTRIBUTE''.

REAL OPERATOR(9,NNODEX,NNODEY) REAL LOCSTEP(9,NNODEX,NNODEY) CMF$LAYOUT OPERATOR(:SERIAL,:NEWS,:NEWS) CMF$LAYOUT LOCSTEP(:SERIAL,:NEWS,:NEWS)

REAL SOLN(NNODEX,NNODEY) REAL STEP(NNODEX,NNODEY) REAL RESID(NNODEX,NNODEY) REAL CONJSTEP(NNODEX,NNODEY) CMF$LAYOUT SOLN(:NEWS,:NEWS) CMF$LAYOUT RESID(:NEWS,:NEWS) CMF$LAYOUT STEP(:NEWS,:NEWS) CMF$LAYOUT CONJSTEP(:NEWS,:NEWS)

Definition of the arrays for a regular 2-D problem. The operator and the work vector for step information are serial vectors of nine points on each processor. All the other arrays have one element per processor.

SUBROUTINE DISTRIBUTE( INPUT, OUTPUT, NNODEX, NNODEY )

C Gather information from all connected nodes. C This can be done with shift operations. C INTEGER NNODEX, NNODEY

REAL OUTPUT(9,NNODEX,NNODEY) CMF$LAYOUT OUTPUT(:SERIAL,:NEWS,:NEWS)

REAL INPUT(NNODEX,NNODEY) CMF$LAYOUT INPUT(:NEWS,:NEWS)

OUTPUT(5,:,:) = INPUT

OUTPUT(4,:,:) = CSHIFT( OUTPUT(5,:,:), DIM=1, SHIFT=-1 ) OUTPUT(1,:,:) = CSHIFT( OUTPUT(4,:,:), DIM=2, SHIFT=-1 ) OUTPUT(7,:,:) = CSHIFT( OUTPUT(4,:,:), DIM=2, SHIFT=1 )

OUTPUT(6,:,:) = CSHIFT( OUTPUT(5,:,:), DIM=1, SHIFT=1 ) OUTPUT(3,:,:) = CSHIFT( OUTPUT(6,:,:), DIM=2, SHIFT=-1 ) OUTPUT(9,:,:) = CSHIFT( OUTPUT(6,:,:), DIM=2, SHIFT=1 )

OUTPUT(2,:,:) = CSHIFT( OUTPUT(5,:,:), DIM=2, SHIFT=-1 ) OUTPUT(8,:,:) = CSHIFT( OUTPUT(5,:,:), DIM=2, SHIFT=1 )

RETURN END

Routine to distribute step information to all processors for a 2-D mesh.

Table 7 shows the timings for the model problem. The first set of figures is for the code given in Figure . The second set of figures is for a version of the code in which the distribution and application steps were replaced by a call to a routine compiled with the stencil compiler. This is a special purpose compiler for compiling algorithms that can be coded as a stencil operation. Biondo Biondi used a pre-release version of this compiler to create the routine that I used.

There are several points to note:

- The cost of the construction step is negligible compared to the solution step.
- It is very important to make the array dimensions a power of two when using the ``CSHIFT'' operator. The first case ()is a worst case example with each dimension being a power of two plus one.
- As the problem size increases, the number of iterations required to converge also increases. Theory predicts that the number of iterations is . The observations are a good match to the theory.
- The stencil compiler is much more efficient than the normal Fortran compiler for stencil operations. The routine produced by the stencil compiler was separately tested and found to run at 500MFlops. I believe that the part of the algorithm that controls the speed of the solution is now the ``SUM'' operation.

12/18/1997