Next: regularized least-squares inversion Up: Tang: Regularized inversion Previous: introduction

bayes inversion

If we let $\bf m$ be the model vector and $\bf d$ be the data vector, Bayes's theorem would state

$\begin{displaymath} p({\bf m, d}) = \frac{p({\bf m})p({\bf d, m})}{p({\bf d})},\end{displaymath}$ (1)

where $p({\bf m, d})$ is the distribution of the model parameters posterior to the data $\bf d$ ,expressing the likelihood of the model $\bf m$ for a given data $\bf d$ ; $p({\bf m})$ is the probability distribution of the model $\bf m$ , representing the prior information about the model; and $p(\bf d, m)$ describes how knowledge of the data modifies the prior knowledge. The quantity $p({\bf d})$ is the probability distribution of the data $\bf d$ , which is a constant for a given data $\bf d$ ; thus $p({\bf d})$ can be seen as a scaling factor Ulrych et al. (2001). Therefore, equation (1) can be simplified as

$\begin{displaymath} p({\bf m, d}) \propto p({\bf m})p({\bf d, m}).\end{displaymath}$ (2)

In the presence of noise, the recorded data $\bf d$ can be expressed as follows:

$\begin{displaymath} {\bf d = Lm + n },\end{displaymath}$ (3)

where $\bf n$ is the noise vector. The likelihood function $p(\bf d, m)$ is constructed by taking the difference between the observed data and the modeled data, thus

$\begin{displaymath} p({\bf d,m}) = p({\bf n}). \end{displaymath}$ (4)

If we assume the noise has a Gaussian distribution with a zero mean and variance $\sigma^2$ , its probability function can be written as follows:

$\begin{displaymath} p(n_i) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{n_i^2}{2\sigma^2}}\end{displaymath}$ (5)

If each component of the noise vector is independent and the variance of each noise sample remains constant, the total probability of the noise vector is

$\begin{eqnarray} p({\bf n}) &=& p(n_1)p(n_2)\cdots p(n_N) \ &=& \frac{1}{(2\pi... ...=1}^{N} n_i^2}\ &=& \alpha_1 e^{\frac{-1}{2 \sigma^2}{\bf n'n}},\end{eqnarray}$ (6)

(7)

(8)

where $\bf n'$ is the adjoint of $\bf n$ . If we further assume that the model parameters $m_i, i=1,2,\ldots,M$ have a Gaussian distribution with a zero mean and a variance $\sigma_m^2$ and are independent, the probability of $\bf m$ then is

$\begin{eqnarray} p({\bf m}) &=& \frac{1}{(2\pi \sigma_m^2)^{M/2}} e^{\frac{-1}{2... ...^2}{\bf m'm}} \ &=& \alpha_2 e^{\frac{-1}{2\sigma_m^2}{\bf m'm}}\end{eqnarray}$ (9)

(10)

Plugging equations (8) and (10) into equation (2) yields the following posterior distribution:

$\begin{eqnarray} p({\bf m, d}) & \propto & \alpha_1 \alpha_2 e^{-\frac{1}{2\sigm... ...{1}{2\sigma_m^2}{\bf m'm}-\frac{1}{2\sigma^2}{\bf (Lm-d)'(Lm-d)}}.\end{eqnarray}$ (11)

(12)

Since the posterior probability $p({\bf m, d})$ is a quantity describing the probability that the model is correct given a certain set of observations, we would like it to be maximized. Maximizing the posterior function is equivalent to finding the minimum of the following objective function:

$\begin{displaymath} J({\bf m}) = {\bf (Lm-d)'(Lm-d)} + \epsilon {\bf m'm},\end{displaymath}$ (13)

where $\epsilon = \frac{\sigma^2}{\sigma_m^2}$ . We can see that if the model parameters are assumed to have a Gaussian distribution, the solution of maximum posterior probability under Bayes's theorem is equivalent to the damped least-squares solution. Thus, Bayesian inversion gives another perspective on the same problem, where minimizing the model residuals in the L₂ norm corresponds to a Gaussian prior probability distribution. As the Gaussian distribution is a short-tailed function that is tightly centered around the mean, it will result in a smooth solution. Therefore, when a sparse solution is required, the L₂ norm is no longer appropriate, and long-tailed distribution functions such as the exponential and Cauchy distributions should be chosen for the prior probability distribution of $\bf m$ .

If we keep equation (8) unchanged and assume $p({\bf m})$ satisfies the exponential probability distribution, with a mean of zero:

$\begin{eqnarray} p({\bf m}) & = & \frac{1}{2\sigma_m}e^{\frac{-1}{\sigma_m}\sum_... ... & = & \alpha_2 e^{\frac{-1}{\sigma_m}\sum_{i=1}^M\vert m_i\vert}.\end{eqnarray}$ (14)

(15)

Then the a posteriori probability becomes

$\begin{displaymath} p({\bf m, d}) \propto \alpha_1 \alpha_2 e^{-\frac{1}{2\sigma... ...(Lm-d)'(Lm-d)}-\frac{1}{\sigma_m}\sum_{i=1}^{M}\vert m_i\vert}.\end{displaymath}$ (16)

Finding the maximum of the above function is equivalent to finding the minimum of the following objective function:

$\begin{displaymath} J({\bf m}) = {\bf (Lm-d)'(Lm-d)} + \epsilon f(\bf m),\end{displaymath}$ (17)

where $\epsilon = \frac{2\sigma^2}{\sigma_m}$ , and the regularization term, $f(\bf m)$ , is defined by

$\begin{displaymath} f({\bf m}) = \sum_{i=1}^M \vert m_i\vert.\end{displaymath}$ (18)

Therefore, by adding an L₁ norm regularization term to the least-squares problem, we get a sparse solution of the model parameters.

Though minimizing the objective function (17) results in a non-linear problem, it can be solved efficiently by Iteratively Re-weighted Least-Squares (IRLS). The gradient of the regularization term is

$\begin{displaymath} \bigtriangledown f({\bf m}) = \frac{\partial}{\partial {\bf ... ...\ \vdots \ \frac{m_M}{\vert m_M\vert} \end{array} \right). \end{displaymath}$ (19)

Therefore the regularization term $f(\bf m)$ , which minimizes the residual in the L₁ norm, can be solved in the L₂ norm by introducing a diagonal weighting function $\bf W_m$ :

$\begin{displaymath} {\bf W_m} = {\bf diag}(\sqrt{\frac{1}{\vert m_1\vert}}, \sqr... ...{1}{\vert m_2\vert}}, \cdots, \sqrt{\frac{1}{\vert m_M\vert}}).\end{displaymath}$ (20)

Then the objective function (17) can be changed to

$\begin{eqnarray} J({\bf m}) &=& {\bf (Lm-d)'(Lm-d)} + \epsilon {\bf (W_m m)'(W_m... ...\ &=& {\bf \Vert Lm-d \Vert _2 + \epsilon \Vert W_m m \Vert _2}.\end{eqnarray}$ (21)

(22)

The derivation of the objective function for the a priori case with a Cauchy distribution is similar, except the diagonal weighting function changes to the following:

$\begin{displaymath} {\bf W_m} = {\bf diag}(\sqrt{\frac{1}{1+(m_1/\sigma_m)^2}}, ... ..._2/\sigma_m)^2}}, \cdots, \sqrt{\frac{1}{1+(m_M/\sigma_m)^2}}).\end{displaymath}$ (23)

Since $\bf W_m$ is a function of the model $\bf m$ , if we use the gradient-based IRLS method to solve the objective function (22), we have to recompute the weight $\bf W_m$ at each iteration and the algorithm can be summarized as follows:

1.

At the first iteration, $\bf W_m$ is set to be the identity matrix:

$\begin{displaymath} {\bf W_m^0=I } \end{displaymath}$

(24)

2.

At the kth iteration, we solve the following fitting goals:

	$\begin{eqnarray} 0 & \approx & {\bf Lm^k-d } \ 0 & \approx & \epsilon {\bf W_m^{k-1}m^k}, \end{eqnarray}$	(25)
		(26)

where W_mi^k-1 = (|m_i^k-1|)^-1/2 in the case of the L₁ norm, or $W_{mi}^{k-1} = \left[ 1+\left(\frac{m_i^k}{\sigma_m}\right)^2\right]^{-1/2}$ in the case of the Cauchy norm.

Next: regularized least-squares inversion Up: Tang: Regularized inversion Previous: introduction

Stanford Exploration Project
1/16/2007

	$\begin{eqnarray} p({\bf n}) &=& p(n_1)p(n_2)\cdots p(n_N) \ &=& \frac{1}{(2\pi... ...=1}^{N} n_i^2}\ &=& \alpha_1 e^{\frac{-1}{2 \sigma^2}{\bf n'n}},\end{eqnarray}$	(6)
		(7)
		(8)

	$\begin{eqnarray} p({\bf m}) &=& \frac{1}{(2\pi \sigma_m^2)^{M/2}} e^{\frac{-1}{2... ...^2}{\bf m'm}} \ &=& \alpha_2 e^{\frac{-1}{2\sigma_m^2}{\bf m'm}}\end{eqnarray}$	(9)
		(10)

	$\begin{eqnarray} p({\bf m, d}) & \propto & \alpha_1 \alpha_2 e^{-\frac{1}{2\sigm... ...{1}{2\sigma_m^2}{\bf m'm}-\frac{1}{2\sigma^2}{\bf (Lm-d)'(Lm-d)}}.\end{eqnarray}$	(11)
		(12)

	$\begin{eqnarray} p({\bf m}) & = & \frac{1}{2\sigma_m}e^{\frac{-1}{\sigma_m}\sum_... ... & = & \alpha_2 e^{\frac{-1}{\sigma_m}\sum_{i=1}^M\vert m_i\vert}.\end{eqnarray}$	(14)
		(15)

	$\begin{eqnarray} J({\bf m}) &=& {\bf (Lm-d)'(Lm-d)} + \epsilon {\bf (W_m m)'(W_m... ...\ &=& {\bf \Vert Lm-d \Vert _2 + \epsilon \Vert W_m m \Vert _2}.\end{eqnarray}$	(21)
		(22)