previous up next print clean
Next: REGULARIZATION Up: Berryman: Nonlinear least squares Previous: INTRODUCTION


In an abstract setting, let $\{d_i\}$ for $i=1,\ldots,m$ be a set of m measurements. This data may be organized into a data m-vector ${\bf d}$. Let ${\bf s}$be an n-vector containing the n parameters for the model/solution space to be constructed. Then, let $N_i({\bf s})$ be a (possibly nonlinear) function of the model parameter vector ${\bf s}$ such that $N_i({\bf s}) = d_i$in infinite precision forward modeling of the physical problem of interest. If the data di and the functions Ni are known, then the inversion problem is to find an approximate ${\bf s}$ that satisfies the data (i. e., has as small a data misfit as possible) in some sense.

If life were simple, the operators Ni would be analytically invertible, so one might think it could be possible to write the solution to the inverse problem as ${\bf s} = N_i^{-1}(d_i)$.However, the result is certainly a strange looking formula: While it is perfectly sensible that the domain of each function Ni is a set of n model parameters, and that the range is a single datum, the postulated inverse function with multiple model outputs from single datum inputs is clearly not sensible. In most cases, there will simply not be sufficient information contained in a single datum to determine multiple model parameters.

For strictly linear problems, there might exist an appropriate operator X acting on the full data m-vector such that $X({\bf d}) = {\bf s}$ produces the model n-vector. However, even for linear problems, the existence of such an operator is not necessarily guaranteed since the sizes m and n of the two vector spaces will generally differ. This difficult situation leads to the introduction of pseudoinverses in linear problems or to the methods to be described now in nonlinear problems.

So, instead of using some hypothetical analytical approach, the desired solution to the inverse problem will make simultaneous use of all (or at least most) of the data while determining some sort of optimal fit to a chosen set of model parameters. This situation is common in data analysis and often leads to formulation of least-squares methods for this type of inversion problem.

An approach to this problem based on nonlinear least-squares inversion considers the nonlinear function

F(s) = _i=1^m [d_i-N_i(s)]^2,   which is simply the sum of the squares of the data residuals ri = di - Ni. I formulate an ``evolution'' equation for the model vector ${\bf s} = {\bf s}(\eta)$, where $\eta$ is an evolution parameter - treated as a continuous generalization of an iteration number. In particular, I must ultimately discretize this parameter in my final numerical method, in which case the particular values of $\eta= 0, 1, 2, \ldots$ will be exactly the iteration numbers. But for purposes of explaining the method, it will prove useful to treat $\eta$ initially as a continuous parameter.

The thrust of the nonlinear least-squares method is to produce a model ${\bf s}$ that minimizes the least-squares error function $F({\bf s})$. One way to guarantee [see, for example, Jeffrey and Rosner (1986) and Lu and Berryman (1991)] that a local minimum is achieved is to pose the evolution equation for ${\bf s}(\eta)$ in a way that guarantees the value of $F({\bf s})$ decreases monotonically as $\eta$ increases (or at each iteration step for the discretized problem). Taking the derivative of $F({\bf s})$ with respect to the evolution parameter, the chain rule gives

F(s) = - 2 _j=1^n _i=1^m [d_i-N_i(s)]N_is_j s_j.   It is desired that the evolution equation for ${\bf s}$ be chosen to guarantee that (derivativeofF) is $\le 0$. It is easy to see that (derivativeofF) will always be negative or zero if the evolution equations for the model parameters sj are chosen to be

s_j = () _i=1^m[d_i-N_i(s)]N_is_j,   where $\gamma(\eta)$ is a positive parameter to be determined. Then,

F(s) = - 2 ()_j=1^n [_i=1^m [d_i-N_i(s)]N_is_j]^2,   so each term in the sum over j in (derivativeofF2) is a square quantity and must be positive or vanish identically. The choice (evolutioneqn) of evolution then clearly guarantees the desired monotonically decreasing behavior of the total squared error functional as $\eta\to \infty$.

To implement this procedure, a choice of discretization must be made together with a choice for $\gamma(\eta)$. The obvious choice of discretization for the evolution equation (evolutioneqn) is

s_j(k+1) - s_j(k) s_j = () [d_i-N_i(s(k))]N_i(s)s_j|_s= s(k),   where I have taken the finite step size for the evolution to be $\Delta\eta= 1$. In the continuous evolution problem, the infinitesimal changes in the evolution parameter guarantee similarly infinitesimal changes in the model vector ${\bf s}(\eta)$ and therefore this makes the choice of $\gamma(\eta)$ largely arbitrary. In contrast, for the discretized problem, the evolution of ${\bf s}$ is finite at each step and care must be taken not to violate the desired condition that the least-squares functional should decrease at each step. Such violations may occur if the step size is taken too large.

Reconsidering (Fofs), I find that, by keeping only those terms proportional to the first and second powers of the components of $\Delta{\bf s}$, I have

F(s(k+1)) = _i=1^m [d_i - N_i(s(k)+s)]^2 _i=1^m [d_i-N_i(s(k))]^2 - 2 _j=1^n _i=1^m [d_i-N_i(s)]N_is_j s_j + _i=1^m [_j=1^nN_is_j s_j]^2.   After substituting (discreteevolution), the parameter $\gamma(\eta)$ can now be chosen so that the right-hand side of (discreteFofs) decreases as much as possible at each step of the iteration scheme. The optimum choice is easily shown to be

(k+1) =                                  _j=1^n [_i=1^m [d_i-N_i(s(k))]N_i s_j]^2 _j,j'[ _i'=1^m [d_i'-N_i'(s(k))]N_i's_j _i=1^mN_is_j N_is_j' _i''=1^m [d_i''-N_i''(s(k))]N_i'' s_j'],   since this minimizes the right-hand side of (discreteFofs).

It will prove enlightening to compare the procedure just presented with the well-known method of conjugate gradients (Hestenes and Stiefel, 1952; Fomel, 1996) for a linear operator such that $N_i({\bf s}) = ({\bf L}{\bf s})_i$. Then, model updates are obtained using

s^(k+1) = s^(k) + ^(k+1)u^(k+1),   where, for the discrete iterative problem, I put the iteration numbers in superscripts. The new vector ${\bf u}^{(k+1)}$ is the (somehow) known update direction in the model space and $\gamma^{(k+1)}$ is a parameter used to optimize the step size. The updated residual vector is again given by

r^(k+1) = d - Ls^(k+1) = r^(k) - ^(k+1)Lu^(k+1).   The magnitude of the residual vector is easily shown to decrease most at each step of the iteration sequence if the optimization parameter satisfies

^(k+1) = (r^(k),Lu^(k+1)) ||Lu^(k+1)||^2.   Equation (cgparameter) is exactly analogous to the formula (lambda) obtained in the nonlinear least-squares problem if the components of the residual are given by

r_i^(k) = d_i - N_i(s^(k)),   the matrix elements of the linear operator are

(L)_ij = L_ij = N_is_j|_ s= s^(k),   and the update direction vector satisfies

^(k+1)u^(k+1) = s,   where the components of the right-hand side of (update) are given by (evolutioneqn) evaluated at ${\bf s}^{(k)}$.With this identification, it becomes clear that the nonlinear least-squares method outlined above is one natural generalization of the well-known conjugate gradients technique.

previous up next print clean
Next: REGULARIZATION Up: Berryman: Nonlinear least squares Previous: INTRODUCTION
Stanford Exploration Project