A singular value decomposition (SVD) of a real matrix A is its factorization into the product of three matrices :
There are several ways of computing the SVD. With serial processing, variants of the QR algorithm (used to compute the eigenvalues of a matrix) can be used to compute the SVD. The basic approach involves the following (Golub and Van Loan, 1989) :
With some manipulations, it can be shown that R is diagonal, so we get : and , completing the SVD computation.
The drawback with this approach is that the formation of ATA can lead to a loss of information. A preferable method for computing the SVD is described in Golub and Kahan (1965). The symmetric QR algorithm is implicitly applied to ATA (in other words, the product ATA is never calculated) in order to find U and V simultaneously.
The approaches described above are very effective on serial processors, but they are not readily amenable to parallel implementations. With parallel processing, the preferred methods are variations of the Jacobi algorithm used to compute the eigenvalues of a symmetric matrix. The classical Jacobi algorithm (Jacobi, 1846) involves pre-multiplication and post-multiplication of A by rotations of the form (Jacobi rotations) :
c and s are the cosine and sine of an angle determined in accordance with the following objective :
where D is a diagonal matrix and app,apq,aqp,aqq are elements of A. With the resulting , B = JTAJ will agree with A except in rows and columns p and q, and is closer to diagonal form than A. Continuing this pre-multiplication and post-multiplication by matrices , using different p and q, we can obtain an orthogonal diagonalization of A. This algorithm can be modified to obtain the SVD of an arbitrary matrix A. A parallel implementation involves the storage of two columns of the matrix per processor. The determination of matrices and post-multiplication can be done in parallel. The problem with this approach is that the transformation matrices J have to move between processors in order to perform the premultiplication that affects the rows.
A parallel implementation using mesh connected processors is described in Brent, et al. (1985). In order to compute the SVD for an matrix, O(n2) processors are needed. This requirement seriously limits the size of the matrices that can be decomposed on the Connection Machine (CM).
The method that is most suitable for implementation on the CM is described in Brent and Luk (1985). It requires O(n) processors and time O(mnS) where (typically ). With this method there is no successive pre-multiplication of the matrix A so the transformation matrices don't have to move between processors. This approach to the SVD computation is an implementation of Hestenes' method (Hestenes, 1958).