In the discussion to follow we assume that A has an even number of columns. If the number of columns is odd, we pad A with a column of zeroes. We use processors to store an matrix Ak, where each processor stores two columns of Ak. Following the explanation given in Golub and Van Loan (1989), for purposes of illustration we assume that A has 8 columns. Then we can store A in the following way :
where U1,U2 are 2-dimensional arrays, Pn is processor n and the numbers indicate columns of Ak. Then we can orthogonalize the four sets of two columns in parallel, using the method described in the previous section. After the columns are orthogonalized, we keep column 1 fixed and rotate the ring defined by (2 3 4 8 7 6 5), obtaining the following configuration :
The rotation can be achieved in less than six steps. We chose this approach because the only data movement between processors is done efficiently with cshift instructions.
The resulting column pairs can then be orthogonalized in parallel. After seven rotations we have processed all the column pairs in A. In general, for an n column matrix, it takes n-1 steps to process all the column pairs. Each set of n-1 steps is referred as a sweep. Throughout the sweep we check the values of all ,with and having been defined in the previous section. If for a sufficiently small holds for each set of in a sweep, then we can say that we have orthogonalized A and obtained W. We can also obtain V if we begin with an identity matrix and at every step of the algorithm we process each Vk in a manner similar to each Ak.
In Brent and Luk (1985), it is stated that although the authors know of no example for which the algorithm fails to give convergence, convergence is not guaranteed. In order to enforce convergence, a threshold approach (Wilkinson, 1965) can be chosen. A threshold value can be associated with each sweep. Any pair of columns in that sweep, with below the threshold, is not processed.