next up previous print clean
Next: REFERENCES Up: THE WORLD OF CONJUGATE Previous: Standard methods

Understanding CG magic and advanced methods

This section includes Sergey Fomel's explanation on the ``magic'' convergence properties of the conjugate-direction methods. It also presents a classic version of conjugate gradients, which can be found in numerous books on least-square optimization.

The key idea for constructing an optimal iteration is to update the solution at each step in the direction, composed by a linear combination of the current direction and all previous solution steps. To see why this is a helpful idea, let us consider first the method of random directions. Substituting expression (47) into formula (45), we see that the residual power decreases at each step by  
 \begin{displaymath}
 \bold r \cdot \bold r -
 \bold r_{\rm new} \cdot \bold r_{\...
 ...elta \bold r )^2}
 {( \Delta \bold r \cdot \Delta \bold r )}\;.\end{displaymath} (90)
To achieve a better convergence, we need to maximize the right hand side of (90). Let us define a new solution step $\bold
s_{\rm new}$ as a combination of the current direction $\Delta \bold x$ and the previous step $\bold s$, as follows:  
 \begin{displaymath}
 \bold s_{\rm new} \eq \Delta \bold x + \beta \bold s\;.\end{displaymath} (91)
The solution update is then defined as  
 \begin{displaymath}
\bold x_{\rm new} \eq \bold x+\alpha \bold s_{\rm new}\;.\end{displaymath} (92)
The formula for $\alpha$ (47) still holds, because we have preserved in (92) the form of equation (41) and just replaced $\Delta \bold x$ with $\bold
s_{\rm new}$. In fact, formula (47) can be simplified a little bit. From (46), we know that $\bold r_{\rm new}$ is orthogonal to $\Delta \bold r = \bold F \bold s_{\rm new}$. Likewise, $\bold r$should be orthogonal to $\bold F \bold s$ (recall that $\bold r$ was $\bold r_{\rm new}$ and $\bold s$ was $\bold
s_{\rm new}$ at the previous iteration). We can conclude that  
 \begin{displaymath}
 (\bold r \cdot \Delta \bold r ) \eq 
 (\bold r \cdot \bold ...
 ...\bold F \bold s) \eq
 (\bold r \cdot \bold F \Delta \bold x)\;.\end{displaymath} (93)
Comparing (93) with (90), we can see that adding a portion of the previous step to the current direction does not change the value of the numerator in expression (90). However, the value of the denominator can be changed. Minimizing the denominator maximizes the residual increase at each step and leads to a faster convergence. This is the denominator minimization that constrains the value of the adjustable coefficient $\beta$ in (91).

The procedure for finding $\beta$ is completely analogous to the derivation of formula (47). We start with expanding the dot product $(\Delta \bold r \cdot \Delta \bold r)$: 
 \begin{displaymath}
 (\bold F \bold s_{\rm new} \cdot \bold F \bold s_{\rm new})...
 ...F \bold s) +
 \beta^2\,\bold F \bold s \cdot \bold F \bold s\;.\end{displaymath} (94)
Differentiating with respect to $\beta$ and setting the derivative to zero, we find that  
 \begin{displaymath}
 0 \eq 2 (\bold F \Delta \bold x + \beta \bold F \bold s) 
 \cdot \bold F \bold s\;.\end{displaymath} (95)
Equation (95) states that the conjugate direction $\bold F \bold s_{\rm new}$ is orthogonal (perpendicular) to the previous conjugate direction $\bold F \bold s$. It also defines the value of $\beta$ as  
 \begin{displaymath}
 \beta \eq - \frac{ (\bold F \Delta \bold x \cdot \bold F \bold s )}
 {(\bold F \bold s \cdot \bold F \bold s )}\;.\end{displaymath} (96)

Can we do even better? The positive quantity that we minimized in (94) decreased by  
 \begin{displaymath}
 \bold F \Delta \bold x \cdot \bold F \Delta \bold x -
 \bol...
 ...bold F \bold s )^2}
 {(\bold F \bold s \cdot \bold F \bold s )}\end{displaymath} (97)
Can we decrease it further by adding another previous step? In general, the answer is positive, and it defines the method of conjugate directions. I will state this result without a formal proof (which uses the method of mathematical induction).

To see why this is an optimally convergent method, it is sufficient to notice that vectors $\bold F \bold s_i$ form an orthogonal basis in the data space. The vector from the current residual to the smallest residual also belongs to that space. If the data size is n, then n basis components (at most) are required to represent this vector, hence no more then n conjugate-direction steps are required to find the solution.

The computation template for the method of conjugate directions is  


		 $\bold r \quad\longleftarrow\quad\bold F \bold x - \bold d$ 
		 iterate { 
		 		  $\Delta \bold x \quad\longleftarrow\quad{\rm random\ numbers}$ 
		 		  $\bold s \quad\longleftarrow\quad\Delta \bold x + 
\sum_{i < n} \beta_i \bold s_...
 ...ld x \cdot \bold F \bold s_i )}
 {(\bold F \bold s_i \cdot \bold F \bold s_i )}$ 
		 		  $\Delta \bold r\ \quad\longleftarrow\quad\bold F \bold s$ 
		 		 $\alpha \quad\longleftarrow\quad
 -( \bold r \cdot \Delta\bold r )/
 (\Delta \bold r \cdot \Delta\bold r )
 $		 		 $\bold x \quad\longleftarrow\quad\bold x + \alpha\bold s$ 
		 		 $\bold r\ \quad\longleftarrow\quad\bold r\ + \alpha\Delta \bold r$ 
		 		 } 

What happens if we ``feed'' the method with gradient directions instead of just random directions? It turns out that in this case we need to remember from all the previous steps $\bold s_i$ only the one that immediately precedes the current iteration. Let us derive a formal proof of that fact as well as some other useful formulas related to the method of conjugate gradients .

According to formula (46), the new residual $\bold r_{\rm new}$ is orthogonal to the conjugate direction $\Delta \bold r = \bold F \bold s_{\rm new}$. According to the orthogonality condition (100), it is also orthogonal to all the previous conjugate directions. Defining $\Delta \bold x$ equal to the gradient $\bold F' \bold r$ and applying the definition of the adjoint operator, it is convenient to rewrite the orthogonality condition in the form  
 \begin{displaymath}
 0 \eq (\bold r_n \cdot \bold F \bold s_i) \eq 
 (\bold F' \...
 ..._{n+1} \cdot \bold s_i) 
 \quad \mbox{for all} \quad i \leq n
 \end{displaymath} (101)
According to formula (98), each solution step $\bold s_i$ is just a linear combination of the gradient $\Delta \bold x_i$ and the previous solution steps. We deduce from formula (101) that  
 \begin{displaymath}
 0 \eq (\Delta \bold x_n \cdot \bold s_i) \eq 
 (\Delta \bold x_n \cdot \Delta \bold x_i)
 \quad \mbox{for all} \quad i < n
 \end{displaymath} (102)
In other words, in the method of conjugate gradients, the current gradient direction is always orthogonal to all the previous directions. The iteration process constructs not only an orthogonal basis in the data space but also an orthogonal basis in the model space, composed of the gradient directions.

Now let us take a closer look at formula (99). Note that $\bold F \bold s_i$ is simply related to the residual step at i-th iteration:  
 \begin{displaymath}
\bold F \bold s_i = \frac{\bold r_i - \bold
 r_{i-1}}{\alpha_i}\;.
 \end{displaymath} (103)
Substituting relationship (103) into formula (99) and applying again the definition of the adjoint operator, we obtain  
 \begin{displaymath}
 \beta_i = 
 - \frac{ \bold F \Delta \bold x_n \cdot (\bold ...
 ...x_i)}
 {\alpha_i (\bold F \bold s_i \cdot \bold F \bold s_i )} \end{displaymath} (104)
Since the gradients $\Delta \bold x_i$ are orthogonal to each other, the dot product in the numerator is equal to zero unless i = n-1. It means that only the immediately preceding step $\bold s_{n-1}$contributes to the definition of the new solution direction $\bold
s_n$ in (98). This is precisely the property of the conjugate gradient method we wanted to prove.

To simplify formula (104), rewrite formula (47) as  
 \begin{displaymath}
 \alpha_i \eq - \frac 
 { (\bold r_{i-1} \cdot \bold F \Delt...
 ... \bold x_i )}
 {( \bold F \bold s_i \cdot \bold F \bold s_i ) }\end{displaymath} (105)
Substituting (105) into (104), we obtain  
 \begin{displaymath}
 \beta = 
 - \frac{( \Delta \bold x_n \cdot \Delta \bold x_n...
 ...d x_n)}
 {(\Delta \bold x_{n-1} \cdot \Delta \bold x_{n-1})}\;.\end{displaymath} (106)

The computation template for the method of conjugate gradients is then  


		 $\bold r \quad\longleftarrow\quad\bold F \bold x - \bold d$ 
		 $\beta \quad\longleftarrow\quad0$ 
		 iterate { 
		 		  $\Delta \bold x \quad\longleftarrow\quad\bold F'\ \bold r$ 
		 		  if not the first iteration $\beta \quad\longleftarrow\quad- \frac{ (\Delta \bold x \cdot \Delta \bold x )}
 { \gamma}$ 
		 		  $\gamma \quad\longleftarrow\quad(\Delta \bold x \cdot \Delta \bold x )$ 
		 		  $\bold s \quad\longleftarrow\quad\Delta \bold x + \beta \bold s$ 
		 		  $\Delta \bold r\ \quad\longleftarrow\quad\bold F \bold s$ 
		 		 $\alpha \quad\longleftarrow\quad
 - \gamma/
 (\Delta \bold r \cdot \Delta\bold r )
 $		 		 $\bold x \quad\longleftarrow\quad\bold x + \alpha\bold s$ 
		 		 $\bold r\ \quad\longleftarrow\quad\bold r\ + \alpha\Delta \bold r$ 
		 		 } 
Module conjgrad [*] provides an implementation of this method. The interface is exactly similar to that of cgstep [*], therefore you can use conjgrad as an argument to solver [*].

When the orthogonality of the gradients, (implied by the classical conjugate-gradient method) is not numerically assured, the conjgrad algorithm may loose its convergence properties. This problem does not exist in the algebraic derivations, but appears in practice because of numerical errors. A proper remedy is to orthogonalize each new gradient against previous ones. Naturally, this increases the cost and memory requirements of the method.

 

module conjgrad_mod {
  real, dimension (:), allocatable, private 	:: s, ss
contains
  subroutine conjgrad_close ()  {
        if( allocated( s))  deallocate( s, ss)
        }
  function conjgrad( forget, x, g, rr, gg)  result( stat) {
    integer              :: stat
    real, dimension (:)  :: x, g, rr, gg
    logical              :: forget
    real, save           :: rnp
    double precision     :: rn, alpha, beta
    rn = sum( dprod( g, g))
    if( .not. allocated( s)) {   forget = .true.
           allocate( s  (size (x )));  s  = 0. 
           allocate( ss (size (rr)));  ss = 0.
           }
    if( forget .or. rnp < epsilon (rnp))
           alpha = 0.d0
    else
           alpha = - rn / rnp
    s  =  g + alpha *  s
    ss = gg + alpha * ss
    beta = sum( dprod( ss, ss))
    if( beta > epsilon( beta)) {
           alpha = rn / beta
           x =  x  + alpha *  s
           rr = rr + alpha * ss
           }
    rnp = rn;    forget = .false.;   stat = 0
    }
}


next up previous print clean
Next: REFERENCES Up: THE WORLD OF CONJUGATE Previous: Standard methods
Stanford Exploration Project
2/27/1998