The first issue to be addressed when implementing algorithms on a massively parallel computer is minimizing the cost of interprocessor communications. Massively parallel computers have the memory distributed among many processors; the processors are linked together by a communication network, which may have varying topology (hypercube, 2-D mesh, etc ...). Because computations can be performed only when all the operands are local to a processor, interprocessor communications are needed when the required data do not reside in the memory of one processor. If the ratio between the time spent performing floating point operations and the time spent moving data among processors is too low, the performance of a parallel computer may be disappointing. To minimize the time spent in communication the user can determine an ``optimal'' mapping of the data into the memory of the processors. The implementation of Fortran 90 on the Connection Machine allows the user to specify the data layout with a simple compiler directive. Figure shows an example of how seismic data can be mapped into processors memory. The layout shown is the one used for the Kirchhoff migration algorithm described in the final section of this paper. The traces in a common-offset section are stored locally in the memory of each processor, while the midpoint direction is parallel. Because of this choice of data mapping the generalized moveout operation that is in the innermost loop of Kirchhoff migration can be performed without inter-processor communications, and the algorithm performs very well.
The most of the algorithms need inter-processor communication even if the data have been correctly mapped. When communication is needed, it is important that a fast type of communication is used. On the Connection Machine, there are three types of communication between processors: nearest neighbor, global, and general communication. The nearest neighbor and the global communications are the fastest, and thus the most desirable to use. In Fortran 90, the communications are expressed at a high level as operations on the data arrays, or on elements of data arrays. The compiler takes care of translating these Fortran statements into the appropriate lower level communication functions. The type of communication that is performed for executing a specific Fortran 90 instruction depends on the data mapping, as well as on the instruction itself. Therefore, the user can control the type of communications performed during the execution of his program by selecting the ``optimal'' data layout and by using the ``correct'' array operations.
The second basic principle for writing efficient parallel programs is to keep the maximum number of processors busy at all times. The first way to avoid idle processors is to have a problem large enough compared to the size of the computer at hand. In general, massively parallel computers are not good at solving small problems. Recursion along one of the parallel axes can be a cause of severe load balancing problems. Performing special computations in the boundary regions of a finite-difference grid is another situation where processor time could be wasted, if the boundary conditions are not carefully designed. There is no general recipe for avoiding load unbalancing, but the most of the time the problem can be solved by changing the data mapping and/or modifying the algorithm. In extreme cases, the data can be redistributed among the processors to balance the computational load, at the expense of extra communications.
In the next section I begin my review of wave-equation algorithms with Fourier domain methods.