The basic outline of our CM migration algorithm follows.
First, a large swath of the input trace data is loaded into the
array `DATA(nt,nx*ny)`, and the trace coordinate headers are loaded
into their respective vectors of the form `Xr(nx*ny)`. To enhance I/O
speed, we transpose the trace data on the front-end using the SEPLIB
routine `Transp`. Then we load the trace data in parallel from the
front-end to the CM using the block data transfer subroutine
`cmf_fe_array_to_cm` and the temporary front-end vector
`fe_temp(nx*ny)`:

c LOAD TRACES INTO DATA... do 300 it=1,nt irec= (ishot-1)*nt+it read(11,rec=irec) (fe_temp(ixy),ixy=1,nxny) call cmf_fe_array_to_cm(data(it,:),fe_temp) 300 continue

Next, the required migration surface geometry distances are calculated
in parallel, given
the source and receiver surface coordinates `(xs,ys)`, `(xr,yr)`,
and the image volume surface coordinates `(xi,yi)`:

c CALC DISTANCES... ps2(:)= (xs(:)-xi(:))**2+(ys(:)-yi(:))**2 pr2(:)= (xr(:)-xi(:))**2+(yr(:)-yi(:))**2

The heart of the migration operation appears in the next serial loop
over `ntau`:

c MAIN LOOP...

do 2000 itau=1,ntau

c Calculate migration imaging time... tau= 0.5*(itau-1)*dt+t0 tau2= tau*tau timg(:)= sqrt(ps2(:)/vs2(:)+tau2)+sqrt(pr2(:)/vr2(:)+tau2)

c Calculate cos(theta) and amplitude correction weights... cosr(:)= tau/(timg(:)+1.0e-06) amp(:)= sqrt(ps2(:)/vs2+tau2)/(sqrt(pr2(:)/vr2+tau2)+1.0e-06)

c Calculate image time pointers... itimg1(:)= int((timg(:)-t0)/dt)+1 where(itimg1(:).ge.nt) itimg1(:)= nt-1 timg1(:)= (itimg1(:)-1)*dt+t0 itimg2(:)= itimg1(:)+1

c Use indirect memory addressing to linearly interpolate trace amps... call cmf_aref_1d(value1(:),data(:,:),itimg1(:),.true.) call cmf_aref_1d(value2(:),data(:,:),itimg2(:),.true.) trcamp(:)= (value1(:)+((value2(:)-value1(:))/dt)*(timg(:)-timg1(:)))

c Sum weighted trace value into migrated image... img(itau,:)= cosr(:)*amp(:)*trcamp(:) + img(itau,:)

2000 continue

This main loop is extremely compact due to the high degree of parallel
efficiency. We have left the explicit notation `data(:,:)` to
demonstrate that all the calculations are being done in parallel over
the `nx*ny` dimension. First, the vertical traveltimes `tau` and
migration imaging times `timg(:)` are computed using heterogeneous
source and receiver time migration (squared) velocities `vs2(:)` and
`vr2(:)`. Next, the obliquity
and divergence compensation amplitudes, `cosr(:)` and `amp(:)`,
are calculated followed by the imaging time pointers `itimg1(:)`
and `itimg2(:)`. Next, the trace values are linearly interpolated
in parallel using the fast in-processor addressing subroutine
`cmf_aref_1d`.
Finally, all migrated values are summed in parallel for one
pseudodepth step `tau` into the migrated image `img(:,:)`.

At this point, we have migrated each input trace into a single
processor (i.e., for a fixed offset per processor). We need to repeat this
operation by migrating the traces into other processors (offsets)
until each trace has been migrated into each processor. This is
efficiently accomplished by the CM `cshift` subroutine, which
shifts each of the traces and its surface coordinate vector
to its adjacent processor in parallel:

data(:,:)= cshift(data(:,:), 2, -1) xs(:)= cshift(xs(:),1,-1) ys(:)= cshift(ys(:),1,-1) xr(:)= cshift(xr(:),1,-1) yr(:)= cshift(yr(:),1,-1)

The number of `cshift`s required is equal to the size of the parallel
dimension `nx*ny`. Depending on the relative sizes of `data(:,:)`
and `img(:,:)`, it may be more efficient to `cshift` `img(:,:)`
and the coordinate vectors `xi(:)` and `yi(:)`, rather than
`data, xs, ys, xr` and `yr` as we have done.

The entire process described above is
repeated for the number of generic trace gathers which need to be reloaded
into core memory until all front-end trace data have been migrated.
Finally, the migrated image is written in parallel from the CM to the front-end
using the block data transfer subroutine
`cmf_fe_array_from_cm` as follows:

do 3000 itau=1,ntau call cmf_fe_array_from_cm(fe_temp,data(itau,:)) write(21,rec=itau) (fe_temp(ixy),ixy=1,nxny) 3000 continue

For efficiency, we have written the data out in transposed format, to be
followed by `Transp` to map it back to trace sequential format.

12/18/1997