** Next:** REFERENCES
** Up:** THE WORLD OF CONJUGATE
** Previous:** Standard methods

This section includes Sergey Fomel's explanation on the ``magic''
convergence properties of the conjugate-direction methods. It also
presents a classic version of conjugate gradients, which can be found
in numerous books on least-square optimization.
The key idea for constructing an optimal iteration is to update the
solution at each step in the direction, composed by a linear
combination of the current direction and all previous solution steps.
To see why this is a helpful idea, let us consider first the method of
random directions. Substituting expression (47) into
formula (45), we see that the residual power
decreases at each step by
| |
(96) |

To achieve a better convergence, we need to maximize the right hand
side of (96). Let us define a new solution step as a combination of the current direction and the previous step , as follows:
| |
(97) |

The solution update is then defined as
| |
(98) |

The formula for (47) still holds, because we have
preserved in (98) the form of equation (41)
and just replaced with . In fact,
formula (47) can be simplified a little bit. From
(46), we know that is orthogonal
to . Likewise, should be orthogonal to (recall that was
and was at the
previous iteration). We can conclude that
| |
(99) |

Comparing (99) with (96), we can see that
adding a portion of the previous step to the current direction does
not change the value of the numerator in expression
(96). However, the value of the denominator can be
changed. Minimizing the denominator maximizes the residual increase at
each step and leads to a faster convergence. This is the denominator
minimization that constrains the value of the adjustable coefficient
in (97).
The procedure for finding is completely analogous to the
derivation of formula (47). We start with expanding the
dot product :

| |
(100) |

Differentiating with respect to and setting the derivative to
zero,
we find that
| |
(101) |

Equation (101) states that the *conjugate direction*
is orthogonal (perpendicular) to the
previous conjugate direction . It also defines the
value of as
| |
(102) |

Can we do even better? The positive quantity that we minimized in
(100) decreased by

| |
(103) |

Can we decrease it further by adding another previous step? In
general, the answer is positive, and it defines the method of
conjugate directions. I will state this result without a formal proof
(which uses the method of mathematical induction).
- If the new step is
composed of the current direction and a combination of all the
previous steps:
| |
(104) |

then the optimal convergence is achieved when
| |
(105) |

- The new conjugate direction is orthogonal to the previous ones:
| |
(106) |

To see why this is an optimally convergent method, it is sufficient to
notice that vectors form an orthogonal basis in
the data space. The vector from the current residual to the smallest
residual also belongs to that space. If the data size is *n*, then *n*
basis components (at most) are required to represent this vector, hence
no more then *n* conjugate-direction steps are required to find the
solution.

The computation template for the method of conjugate directions is

`
iterate {
}
`

What happens if we ``feed'' the method with gradient directions
instead of just random directions? It turns out that in this case we
need to remember from all the previous steps only the one
that immediately precedes the current iteration. Let us derive a
formal proof of that fact as well as some other useful formulas
related to the method of *conjugate gradients* .

According to formula (46), the new residual is orthogonal to the conjugate direction . According to the orthogonality condition
(106), it is also orthogonal to all the previous
conjugate directions. Defining equal to the gradient
and applying the definition of the adjoint
operator, it is convenient to rewrite the orthogonality condition in
the form

| |
(107) |

According to formula (104), each solution step is just a linear combination of the gradient and
the previous solution steps. We deduce from formula
(107) that
| |
(108) |

In other words, in the method of conjugate gradients, the current
gradient direction is always orthogonal to all the previous
directions. The iteration process constructs not only an orthogonal
basis in the data space but also an orthogonal basis in the model
space, composed of the gradient directions.
Now let us take a closer look at formula (105). Note
that is simply related to the residual step at
*i*-th iteration:

| |
(109) |

Substituting relationship (109) into formula
(105) and applying again the definition of the adjoint
operator, we obtain
| |
(110) |

Since the gradients are orthogonal to each other,
the dot product in the numerator is equal to zero unless *i* = *n*-1. It
means that only the immediately preceding step contributes to the definition of the new solution direction in (104). This is precisely the property of the
conjugate gradient method we wanted to prove.
To simplify formula (110), rewrite formula (47) as

| |
(111) |

Substituting (111) into (110), we obtain
| |
(112) |

The computation template for the method of conjugate gradients is then

`
iterate {
if not the first iteration
}
`

** Next:** REFERENCES
** Up:** THE WORLD OF CONJUGATE
** Previous:** Standard methods
Stanford Exploration Project

4/27/2004