All the code needed for the SVD computation was written as a CM FORTRAN subroutine. The interface to a calling routine and the code listing are shown in the appendix. Even though the goal was to handle a general matrix, all performance evaluations leading to this report were made on matrices where n is an integral power of 2. This choice of matrices has facilitated fine tuning on the CM and only minor modifications will be needed to handle general matrices once performance has been optimized with the restricted set.
All the testing was done on a CM with 128 physical processors (amounting to 4K virtual processors). Single precision was used throughout in order to be able to process as large matrices as possible. Even so, the largest matrices that can be processed are matrices. With the current data layout, performance improves with increasing matrix size. The best performance achieved so far has been 14 Mflops with matrices. The main reason accounting for this less than optimal performance is the time needed to shift the columns of the matrix after every orthogonalization step. However, the code optimization phase is not complete, and the following remains to be done :
The flop rate should improve by a factor of three to five once the changes noted above are made.