next up [*] print clean
Next: About this document ... Up: Table of Contents

Short Note
Spectral preconditioning

Jon Claerbout and Dave Nichols


Industry uses many scaled adjoint operators. We'd like to define the best scaling functions. It is not clear how to define the ``best'' scaling but one way is to define it as ``that which causes CG inversion to go the fastest''. The rationale for this choice is that inversion is presumeably the goal, and the scaled adjoint is the first step.

Here I will briefly describe conventional wisdom about scaling and then go on to show how spectral scaling of Kirchoff and other operators can be done using multidimensional prediction-error filters.


Iterative methods like conjugate gradients (CG) can sometimes be accelerated by a change of variables. Say we are solving $ \bold B \bold m \approx \bold d$.We implicitly define new variables $\bold x$by a ``trial solution'' $ \bold m = \bold C \bold x$(where $\bold C$ is any matrix of rank equal the dimension of $\bold m$)getting $ ( \bold B \bold C) \bold x \approx \bold d$.After solving this system for $\bold x$,we merely substitute into $ \bold m = \bold C \bold x$to get the solution to the original problem. The question is whether this change of variables has saved any effort. On the surface, things seem worse. Instead of iterative applications of $\bold B$ and $\bold B'$we have introduced iterative applications of $\bold B\bold C$ and $\bold C'\bold B$.This is not a problem if the operator $\bold C$ is quicker than $\bold B$.Our big hope is that we have chosen $\bold C$so that the number of iterations decreases. We have little experience to guide the choice of $\bold C$to cause the number of iterations to decrease other than that ``columns of $(\bold B \bold C)$should have equal scales'' so $\bold C$could be a diagonal matrix scaling columns of $\bold B$.The preconditioning matrix $\bold C$need not even be square.

The use of a preconditioner does not change the final solution unless the operator $\bold B$ has a null space. In that case, iterating with $\bold B$ leads to a solution with no component in the null space of $\bold B$.On the other hand, iterating with $\bold B\bold C$ leads to a solution with no component in the null space of $\bold B\bold C$.We will not pause for a proof since no application comes directly to mind.

Scaling the adjoint Given the usual linearized regression between data space and model space, $ \bold d \approx \bold B \bold m$,the simplest image of the model space results from application of the adjoint operator $ \hat \bold m = \bold B' \bold d$.Unless $\bold B$ has no physical units, however, the physical units of $\hat \bold m$ do not match those of $\bold m$so we need a scaling factor. The theoretical solution $\bold m_{\rm theor} = (\bold B'\bold B)^{-1}\bold B'\bold d$suggests the scaling units should be those of $(\bold B'\bold B)^{-1}$.We could probe the operator $\bold B$or its adjoint with white noise or a zero frequency input. Bill Symes suggests we probe with the data $\bold d$ because it has the spectrum of interest. He proposes we make our image with $\hat \bold m = \bold W^2 \bold B'\bold d$where we choose the weighting function to be:
\bold W^2 \quad =\quad{\bf diag}
 \bold B' \bold d \over \bold B'\bold B\bold B'\bold d
\EQNLABEL{adjointwt}\end{displaymath} (1)
which obviously has the correct physical units. The weight $\bold W^2$can be thought of as a diagonal matrix containing the ratio of two images. A problem with the choice adjointwt is that the denominator might vanish or might even be negative. The way to stabilize any ratio is to revise it by changing $\bold W^2= {\bf diag} (a/b)$ to
\bold W^2 \quad =\quad{\bf diag}
 {< ab \gt\over < b^2 + \epsilon^2 \gt}
\EQNLABEL{smoothwt}\end{displaymath} (2)
where $\epsilon$ is a parameter to be chosen, and the angle braces indicate the possible need for local smoothing.

Since a scaled adjoint is a guess at the solution to the regression, it is logical to choose values for $\epsilon$ and the smoothing parameters that give fastest convergence of the conjugate-gradient regression. To go beyond the scaled adjoint we can use $\bold W$ as a preconditioner.

To use $\bold W$ as a preconditioner we define implicitly a new set of variables $\bold x$by the substitution $\bold m=\bold W\bold x$.Then $\bold d \approx \bold B\bold m=\bold B\bold W\bold x$.To find $\bold x$ instead of $\bold m$ we do CG iteration with the operator $\bold B\bold W$ instead of with $\bold B$.At the end we convert from $\bold x$ back to $\bold m$with $\bold m=\bold W\bold x$.

By adjointwt, $\bold W$ has physical units inverse to $\bold B$.Thus the transformation $\bold B\bold W$ has no units so the $\bold x$ variables have physical units of data space. Note that after one iteration $\hat\bold m = \bold W (\bold B\bold W)'\bold d = \bold W^2 \bold B'\bold d$we have Symes scaling. Sometimes it might be more natural to view the solution with data units $\bold x$than with proper model units $\bold m$.


In the regression $ \bold d \approx \bold B \bold m$we often think of the model as white, meaning that it is defined up to the Nyquist frequency on the computational mesh, and we think of the data as red, meaning that it is sampled significantly more densely than the Nyquist frequency. Fitting $ \bold d \approx \bold B \bold m$,we can probe the operator $\bold B$ with random numbers $\bold r_m$ in model space and we can probe the operator $\bold B'$ with random numbers $\bold r_d$ in data space getting a red model image and a red synthetic data space
\bold m_r &=& \bold B' \bold r_d \\ \bold d_r &=& \bold B \bold r_m \end{eqnarray} (3)
since both $\bold B$ and $\bold B'$ turn white into red. Given $\bold m_r$ we can define a whitening operator $\bold A_m$in model space and given $\bold d_r$ we can define a whitening operator $\bold A_d$in data space.

Red-space iteration First we use $\bold A_m$ to implicitly define a new model space $\tilde \bold m$(which I will later justify calling the ``data-colored'' model) by the substitution $\bold m = \bold A_m \tilde \bold m$.Substituting into $ \bold d \approx \bold B \bold m$ gives $\bold d\approx \bold B \bold A_m \tilde \bold m$.We could solve this by CG iteration of $\bold B\bold A_m$.The first step is the adjoint. After this step, the data-colored model estimate $\hat{\tilde \bold m}$ and model estimate $\hat \bold m$ are

\hat{\tilde \bold m} &=& \bold A_m' ...
 ... \times {\rm red} \times {\rm red}
 &=& {\rm white} \end{array}\end{displaymath} (5)
which suggests naming $\hat{\tilde \bold m}$ the data-colored model.

White-space iteration Next we try the data-space whitener $\bold A_d$which we obtained by pouring a random-number model into $\bold B$and designing a decon filter $\bold A_d$ on the data out. The regression $ \bold 0 \approx \bold d - \bold B \bold m$is red, so it needs a leveling weighting function to whiten it. Using $\bold A_d$ for the weighting function gives
\bold A_d \bold d \approx \bold A_d \bold B \bold m \quad =\quad{\rm white}\end{displaymath} (6)
so CG iterations are done with the operator $\bold A_d \bold B$.The first iteration is the adjoint, namely,
\hat \bold m \quad =\quad\bold B' \bold A'_d \bold A_d \bold d \quad =\quad{\rm white}\end{displaymath} (7)

Residual-whitening iteration Gauss says the noise $\bold d-\bold B\bold m$should be whitened. The data whitener $\bold A_d$ might do it, but really we should be using the residual whitener. Thus, after some number of iterations by any method we have a residual and we can find a whitener for it, say $\bold A_n$.Thus the fitting we should do is
\bold A_n \bold d \quad\approx\quad \bold A_n \bold B \bold m\end{displaymath} (8)

As previously, pouring random numbers in data space $\bold r_d$ into the operator $(\bold A_n \bold B)'$ gives a model space $\bold m_r = \bold B' \bold A_n' \bold r_d$from which we can derive a whitener $\bold A_m$ to implicitly define a new model space $\tilde \bold m$ by $\bold m = \bold A_m \tilde \bold m$.Converting the regression from $\bold m$ to $\tilde \bold m$ gives

\bold A_n \bold d \quad\approx\quad \bold A_n \bold B \bold A_m \tilde \bold m\end{displaymath} (9)
The first iteration is now
\hat{\tilde \bold m} &=& \bold A_m' \bol...
 ..._m \bold A_m' \bold B'
 \bold A_n' \bold A_n \bold d\end{array}\end{displaymath} (10)
This proliferation of interesting methodologies is why we need C++!


Where might spectral scaling be useful? First I should describe a similar example with a surprising outcome. With Kirchoff operators in PVI and BEI I proposed integrating the velocity transform with the rho filter $\sqrt{-i\omega}$.Dave Nichols tested this idea and found that the rho filter accelerates the early steps, but retards all the others! This suggests that deconvolution might suffer the same fate. On the other hand, the slowed convergence with $\sqrt{-i\omega}$might simply be a consequence of Fourier methods spreading signals long distances with wraparounds that would not be part of a decon approach. In any event, improving the first iteration is a worthwhile goal.

The practical goal is not just to handle the simple rho-filter effect, but to cope with truncations and irregular data sampling. The model space is a regular mesh (although the data might be on an irregular one) so deconvolution there always makes sense. To manufacture a striking example, I would consider velocity transformation with a short, wide-offset marine cable, say extending from 1.5 to 2.0 km. The velocity space would be strongly ``dipped'' so a simple $5\times 2$ filter should be able to flatten its dip spectrum.

next up [*] print clean
Next: About this document ... Up: Table of Contents
Stanford Exploration Project