Next: regularized least-squares inversion
Up: Tang: Regularized inversion
Previous: introduction
If we let
be the model vector and
be the data vector, Bayes's theorem would state
| ![\begin{displaymath}
p({\bf m, d}) = \frac{p({\bf m})p({\bf d, m})}{p({\bf d})},\end{displaymath}](img3.gif) |
(1) |
where
is the distribution of the model parameters posterior to the data
,expressing the likelihood of the model
for a given data
;
is the probability distribution of the model
, representing the
prior information about the model; and
describes how knowledge of the data modifies
the prior knowledge. The quantity
is the probability distribution of the data
,
which is a constant for a given data
; thus
can be seen as a scaling factor Ulrych et al. (2001).
Therefore, equation (1) can be simplified as
| ![\begin{displaymath}
p({\bf m, d}) \propto p({\bf m})p({\bf d, m}).\end{displaymath}](img8.gif) |
(2) |
In the presence of noise, the recorded data
can be expressed as follows:
| ![\begin{displaymath}
{\bf d = Lm + n },\end{displaymath}](img9.gif) |
(3) |
where
is the noise vector. The likelihood function
is constructed by
taking the difference between the observed data and the modeled data, thus
| ![\begin{displaymath}
p({\bf d,m}) = p({\bf n}). \end{displaymath}](img11.gif) |
(4) |
If we assume the noise has a Gaussian distribution with a zero mean and variance
, its
probability function can be written as follows:
| ![\begin{displaymath}
p(n_i) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{n_i^2}{2\sigma^2}}\end{displaymath}](img13.gif) |
(5) |
If each component of the noise vector is independent and
the variance of each noise sample remains constant, the total
probability of the noise vector is
| ![\begin{eqnarray}
p({\bf n}) &=& p(n_1)p(n_2)\cdots p(n_N) \ &=& \frac{1}{(2\pi...
...=1}^{N} n_i^2}\ &=& \alpha_1 e^{\frac{-1}{2 \sigma^2}{\bf n'n}},\end{eqnarray}](img14.gif) |
(6) |
| (7) |
| (8) |
where
is the adjoint of
. If we further assume that the model parameters
have a Gaussian distribution with a zero mean and a variance
and are independent, the probability
of
then is
| ![\begin{eqnarray}
p({\bf m}) &=& \frac{1}{(2\pi \sigma_m^2)^{M/2}} e^{\frac{-1}{2...
...^2}{\bf m'm}} \ &=& \alpha_2 e^{\frac{-1}{2\sigma_m^2}{\bf m'm}}\end{eqnarray}](img18.gif) |
(9) |
| (10) |
Plugging equations (8) and (10) into equation (2) yields the following posterior distribution:
| ![\begin{eqnarray}
p({\bf m, d}) & \propto & \alpha_1 \alpha_2 e^{-\frac{1}{2\sigm...
...{1}{2\sigma_m^2}{\bf m'm}-\frac{1}{2\sigma^2}{\bf (Lm-d)'(Lm-d)}}.\end{eqnarray}](img19.gif) |
(11) |
| (12) |
Since the posterior probability
is a quantity describing the probability that the model is correct
given a certain set of observations, we would like it to be maximized. Maximizing the posterior function is
equivalent to finding the minimum of the following objective function:
| ![\begin{displaymath}
J({\bf m}) = {\bf (Lm-d)'(Lm-d)} + \epsilon {\bf m'm},\end{displaymath}](img20.gif) |
(13) |
where
. We can see that if the model parameters are assumed to
have a Gaussian distribution, the solution of maximum posterior probability under Bayes's theorem is equivalent to
the damped least-squares solution. Thus, Bayesian inversion gives another perspective on the same problem,
where minimizing the model residuals in the L2 norm corresponds to a Gaussian prior probability distribution.
As the Gaussian distribution is a short-tailed function that is tightly centered around the mean, it will
result in a smooth solution. Therefore, when a sparse solution is required, the L2
norm is no longer appropriate, and long-tailed distribution functions such as the exponential and Cauchy distributions
should be chosen for the prior probability distribution of
.
If we keep equation (8) unchanged and assume
satisfies the exponential probability
distribution, with a mean of zero:
| ![\begin{eqnarray}
p({\bf m}) & = & \frac{1}{2\sigma_m}e^{\frac{-1}{\sigma_m}\sum_...
... & = & \alpha_2 e^{\frac{-1}{\sigma_m}\sum_{i=1}^M\vert m_i\vert}.\end{eqnarray}](img22.gif) |
(14) |
| (15) |
Then the a posteriori probability becomes
| ![\begin{displaymath}
p({\bf m, d}) \propto \alpha_1 \alpha_2 e^{-\frac{1}{2\sigma...
...(Lm-d)'(Lm-d)}-\frac{1}{\sigma_m}\sum_{i=1}^{M}\vert m_i\vert}.\end{displaymath}](img23.gif) |
(16) |
Finding the maximum of the above function is equivalent to finding the minimum of the following objective function:
| ![\begin{displaymath}
J({\bf m}) = {\bf (Lm-d)'(Lm-d)} + \epsilon f(\bf m),\end{displaymath}](img24.gif) |
(17) |
where
, and the regularization term,
, is defined by
| ![\begin{displaymath}
f({\bf m}) = \sum_{i=1}^M \vert m_i\vert.\end{displaymath}](img27.gif) |
(18) |
Therefore, by adding an L1 norm regularization term to the least-squares problem, we get a sparse solution
of the model parameters.
Though minimizing the objective function (17) results in a non-linear problem, it can be solved efficiently
by Iteratively Re-weighted Least-Squares (IRLS). The gradient of the regularization term is
| ![\begin{displaymath}
\bigtriangledown f({\bf m}) = \frac{\partial}{\partial {\bf ...
...\ \vdots \ \frac{m_M}{\vert m_M\vert}
\end{array} \right). \end{displaymath}](img28.gif) |
(19) |
Therefore the regularization term
, which minimizes the residual in the L1 norm, can be solved
in the L2 norm by introducing a diagonal weighting function
:
| ![\begin{displaymath}
{\bf W_m} = {\bf diag}(\sqrt{\frac{1}{\vert m_1\vert}}, \sqr...
...{1}{\vert m_2\vert}}, \cdots, \sqrt{\frac{1}{\vert m_M\vert}}).\end{displaymath}](img30.gif) |
(20) |
Then the objective function (17) can be changed to
| ![\begin{eqnarray}
J({\bf m}) &=& {\bf (Lm-d)'(Lm-d)} + \epsilon {\bf (W_m m)'(W_m...
...\ &=& {\bf \Vert Lm-d \Vert _2 + \epsilon \Vert W_m m \Vert _2}.\end{eqnarray}](img31.gif) |
(21) |
| (22) |
The derivation of the objective function for the a priori case with a Cauchy distribution is similar, except
the diagonal weighting function changes to the following:
| ![\begin{displaymath}
{\bf W_m} = {\bf diag}(\sqrt{\frac{1}{1+(m_1/\sigma_m)^2}}, ...
..._2/\sigma_m)^2}}, \cdots, \sqrt{\frac{1}{1+(m_M/\sigma_m)^2}}).\end{displaymath}](img32.gif) |
(23) |
Since
is a function of the model
, if we use the gradient-based IRLS method to solve the objective
function (22), we have to recompute the weight
at each iteration and the algorithm can be summarized as
follows:
- 1.
- At the first iteration,
is set to be the identity matrix:
| ![\begin{displaymath}
{\bf W_m^0=I }
\end{displaymath}](img33.gif) |
(24) |
- 2.
- At the kth iteration, we solve the following fitting goals:
| ![\begin{eqnarray}
0 & \approx & {\bf Lm^k-d } \ 0 & \approx & \epsilon {\bf W_m^{k-1}m^k},
\end{eqnarray}](img34.gif) |
(25) |
| (26) |
where Wmik-1 = (|mik-1|)-1/2 in the case of the L1 norm,
or
in the case of the Cauchy norm.
Next: regularized least-squares inversion
Up: Tang: Regularized inversion
Previous: introduction
Stanford Exploration Project
1/16/2007