All the code needed for the SVD computation was written as a CM FORTRAN subroutine. The interface to a calling routine and the code listing are shown in the appendix. Even though the goal was to handle a general matrix, all performance evaluations leading to this report were made on matrices where *n* is an integral power of 2. This choice of matrices has facilitated fine tuning on the CM and only minor modifications will be needed to handle general matrices once performance has been optimized with the restricted set.

All the testing was done on a CM with 128 physical processors (amounting to 4*K* virtual processors). Single precision was used throughout in order to be able to process as large matrices as possible. Even so, the largest matrices that can be processed are matrices. With the current data layout, performance improves with increasing matrix size. The best performance achieved so far has been 14 Mflops with matrices. The main reason accounting for this less than optimal performance is the time needed to shift the columns of the matrix after every orthogonalization step. However, the code optimization phase is not complete, and the following remains to be done :

- Modify the data layout so more than one processor is used to access a column pair. The immediate impact is an increase in the flop rate because more processors are used in the computations. In making this modification, special attention has to be given to column shifting with an eye for optimizing that function.
- Alter the column shifting procedure described in the previous section in order to minimize processor idle time. As mentioned before, this modification may have to be coordinated with the changes in data layout.

The flop rate should improve by a factor of three to five once the changes noted above are made.

12/18/1997