next up previous print clean
Next: regularized least-squares inversion Up: Tang: Regularized inversion Previous: introduction

bayes inversion

If we let $\bf m$ be the model vector and $\bf d$ be the data vector, Bayes's theorem would state  
p({\bf m, d}) = \frac{p({\bf m})p({\bf d, m})}{p({\bf d})},\end{displaymath} (1)
where $p({\bf m, d})$ is the distribution of the model parameters posterior to the data $\bf d$,expressing the likelihood of the model $\bf m$ for a given data $\bf d$;$p({\bf m})$ is the probability distribution of the model $\bf m$, representing the prior information about the model; and $p(\bf d, m)$ describes how knowledge of the data modifies the prior knowledge. The quantity $p({\bf d})$ is the probability distribution of the data $\bf d$, which is a constant for a given data $\bf d$; thus $p({\bf d})$ can be seen as a scaling factor Ulrych et al. (2001). Therefore, equation (1) can be simplified as  
p({\bf m, d}) \propto p({\bf m})p({\bf d, m}).\end{displaymath} (2)

In the presence of noise, the recorded data $\bf d$ can be expressed as follows:
{\bf d = Lm + n },\end{displaymath} (3)
where $\bf n$ is the noise vector. The likelihood function $p(\bf d, m)$ is constructed by taking the difference between the observed data and the modeled data, thus
p({\bf d,m}) = p({\bf n}). \end{displaymath} (4)
If we assume the noise has a Gaussian distribution with a zero mean and variance $\sigma^2$, its probability function can be written as follows:
p(n_i) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{n_i^2}{2\sigma^2}}\end{displaymath} (5)
If each component of the noise vector is independent and the variance of each noise sample remains constant, the total probability of the noise vector is
p({\bf n}) &=& p(n_1)p(n_2)\cdots p(n_N) \  &=& \frac{1}{(2\pi...
 ...=1}^{N} n_i^2}\  &=& \alpha_1 e^{\frac{-1}{2 \sigma^2}{\bf n'n}},\end{eqnarray} (6)
where $\bf n'$ is the adjoint of $\bf n$. If we further assume that the model parameters $m_i, i=1,2,\ldots,M$have a Gaussian distribution with a zero mean and a variance $\sigma_m^2$ and are independent, the probability of $\bf m$ then is
p({\bf m}) &=& \frac{1}{(2\pi \sigma_m^2)^{M/2}} e^{\frac{-1}{2...
 ...^2}{\bf m'm}} \  &=& \alpha_2 e^{\frac{-1}{2\sigma_m^2}{\bf m'm}}\end{eqnarray} (9)
Plugging equations (8) and (10) into equation (2) yields the following posterior distribution:
p({\bf m, d}) & \propto & \alpha_1 \alpha_2 e^{-\frac{1}{2\sigm...
 ...{1}{2\sigma_m^2}{\bf m'm}-\frac{1}{2\sigma^2}{\bf (Lm-d)'(Lm-d)}}.\end{eqnarray} (11)
Since the posterior probability $p({\bf m, d})$ is a quantity describing the probability that the model is correct given a certain set of observations, we would like it to be maximized. Maximizing the posterior function is equivalent to finding the minimum of the following objective function:
J({\bf m}) = {\bf (Lm-d)'(Lm-d)} + \epsilon {\bf m'm},\end{displaymath} (13)
where $\epsilon = \frac{\sigma^2}{\sigma_m^2}$. We can see that if the model parameters are assumed to have a Gaussian distribution, the solution of maximum posterior probability under Bayes's theorem is equivalent to the damped least-squares solution. Thus, Bayesian inversion gives another perspective on the same problem, where minimizing the model residuals in the L2 norm corresponds to a Gaussian prior probability distribution. As the Gaussian distribution is a short-tailed function that is tightly centered around the mean, it will result in a smooth solution. Therefore, when a sparse solution is required, the L2 norm is no longer appropriate, and long-tailed distribution functions such as the exponential and Cauchy distributions should be chosen for the prior probability distribution of $\bf m$.

If we keep equation (8) unchanged and assume $p({\bf m})$ satisfies the exponential probability distribution, with a mean of zero:
p({\bf m}) & = & \frac{1}{2\sigma_m}e^{\frac{-1}{\sigma_m}\sum_...
 ... & = & \alpha_2 e^{\frac{-1}{\sigma_m}\sum_{i=1}^M\vert m_i\vert}.\end{eqnarray} (14)
Then the a posteriori probability becomes
p({\bf m, d}) \propto \alpha_1 \alpha_2 e^{-\frac{1}{2\sigma...
 ...(Lm-d)'(Lm-d)}-\frac{1}{\sigma_m}\sum_{i=1}^{M}\vert m_i\vert}.\end{displaymath} (16)

Finding the maximum of the above function is equivalent to finding the minimum of the following objective function:  
J({\bf m}) = {\bf (Lm-d)'(Lm-d)} + \epsilon f(\bf m),\end{displaymath} (17)
where $\epsilon = \frac{2\sigma^2}{\sigma_m}$, and the regularization term, $f(\bf m)$, is defined by
f({\bf m}) = \sum_{i=1}^M \vert m_i\vert.\end{displaymath} (18)
Therefore, by adding an L1 norm regularization term to the least-squares problem, we get a sparse solution of the model parameters.

Though minimizing the objective function (17) results in a non-linear problem, it can be solved efficiently by Iteratively Re-weighted Least-Squares (IRLS). The gradient of the regularization term is
\bigtriangledown f({\bf m}) = \frac{\partial}{\partial {\bf ...
 ...\  \vdots \  \frac{m_M}{\vert m_M\vert}
 \end{array} \right). \end{displaymath} (19)
Therefore the regularization term $f(\bf m)$, which minimizes the residual in the L1 norm, can be solved in the L2 norm by introducing a diagonal weighting function $\bf W_m$:
{\bf W_m} = {\bf diag}(\sqrt{\frac{1}{\vert m_1\vert}}, \sqr...
 ...{1}{\vert m_2\vert}}, \cdots, \sqrt{\frac{1}{\vert m_M\vert}}).\end{displaymath} (20)
Then the objective function (17) can be changed to
J({\bf m}) &=& {\bf (Lm-d)'(Lm-d)} + \epsilon {\bf (W_m m)'(W_m...
 ...\  &=& {\bf \Vert Lm-d \Vert _2 + \epsilon \Vert W_m m \Vert _2}.\end{eqnarray} (21)

The derivation of the objective function for the a priori case with a Cauchy distribution is similar, except the diagonal weighting function changes to the following:
{\bf W_m} = {\bf diag}(\sqrt{\frac{1}{1+(m_1/\sigma_m)^2}}, ...
 ..._2/\sigma_m)^2}}, \cdots, \sqrt{\frac{1}{1+(m_M/\sigma_m)^2}}).\end{displaymath} (23)

Since $\bf W_m$ is a function of the model $\bf m$, if we use the gradient-based IRLS method to solve the objective function (22), we have to recompute the weight $\bf W_m$ at each iteration and the algorithm can be summarized as follows:

At the first iteration, $\bf W_m$ is set to be the identity matrix:
{\bf W_m^0=I }
 \end{displaymath} (24)
At the kth iteration, we solve the following fitting goals:
0 & \approx & {\bf Lm^k-d } \  0 & \approx & \epsilon {\bf W_m^{k-1}m^k},
 \end{eqnarray} (25)
where Wmik-1 = (|mik-1|)-1/2 in the case of the L1 norm, or $W_{mi}^{k-1} = \left[ 1+\left(\frac{m_i^k}{\sigma_m}\right)^2\right]^{-1/2}$ in the case of the Cauchy norm.

next up previous print clean
Next: regularized least-squares inversion Up: Tang: Regularized inversion Previous: introduction
Stanford Exploration Project