next up previous print clean
Next: What should we optimize? Up: Multidimensional autoregression Previous: The bane of PEF

MULTIVARIATE SPECTRUM

A common spectrum is the Fourier spectrum. More fundamentally, a spectrum is a decomposition of a model space or data space into components. The components are in some sense independent; more specifically, the components are orthogonal to one another. Another well-known spectrum is provided by eigenvectors and eigenvalues. In statistical signal processing we handle a third type of spectrum, the multivariate spectrum.

Working in an optimization problem, we begin from residuals between theory and practice. These residuals can be scaled to make new optimization residuals before we start minimizing their energy. What scaling should we use? The scaling can be a simple weighting function or a filter. A filter is simply a weighting function in Fourier space.

The basic idea of common sense, which also comes to us as results proven by Gauss or from the theory of statistical signal processing, is this: The optimization residuals should be roughly of equal scale. This makes sense because squaring magnifies scale, and anything small will be ignored while anything large will dominate. Scaling optimization residuals to be in a common range makes them all equally influential on the final solution. Not only should optimization residuals be of like scale in physical space, they should be of like scale in Fourier space or eigenvector space, or any other space that we might use to represent the optimization residuals. This implies that the optimization residuals should be uncorrelated. If the optimization residuals were correlated, they would have a spectrum that was not white. Not white means of differing sizes in Fourier space. Residuals should be the same size as one another in physical space, likewise in Fourier space. Thus the optimization residuals should be orthogonal and of unit scale, much like Fourier components or as eigenvectors are orthonormal.

Let us approach the problem backwards. Suppose we have two random variables that we take to be the ideal optimization residuals x1 and x2. In reality the two may be few or trillions. In the language of statistics, the optimization residuals are expected to have zero mean, an idea that is formalized by writing E(x1)=0 and E(x2)=0. Likewise these ideal optimization residuals have equal energy, E(x12)=1 and E(x22)=1. Finally, these two optimization residuals are uncorrelated, a condition which is written as E(x1 x2)=0. The expectation symbol E() is like a summation over many instances of the random variable.

Now suppose there exists a transformation $\bold B$from these ideal optimization residuals to two experimental residuals y1 and y2, say $\bold y = \bold B \bold x$ where
\begin{displaymath}
\left[
 \begin{array}
{l}
 y_1 \\  y_2
 \end{array} \right]
...
 ...t]
 \left[
 \begin{array}
{l}
 x_1 \\  x_2
 \end{array} \right]\end{displaymath} (51)
The experimental residuals y1 and y2 are likely to be neither orthogonal nor equal in energy. From the column vector $\bold y$,the experimenter can form a square matrix. Let us also allow the experimenter to write the symbol E() to denote summation over many trials or over many sections of data, ranges over time or space, over soundings or over receiver locations. The experimenter writes
\begin{eqnarray}
\bold R &=& E ( \bold y \bold y' ) \\ \bold R &=& E ( \bold B \bold x \bold x' \bold B' )\end{eqnarray} (52)
(53)
Given a random variable r, the expectation of 2r is simply E(2r)=2E(r). The E() symbol is a summation on random variables, but constants like the coefficients of $\bold B$ pass right through it. Thus,
\begin{eqnarray}
\bold R &=& \bold B\ E(\bold x \bold x')\ \bold B' \\ \bold R &...
 ..._2)
 \end{array} \right]
\bold B'
 \\ \bold R &=& \bold B \bold B'\end{eqnarray} (54)
(55)
(56)
(57)

Given a matrix $\bold R$,there is a simple well-known method called the Cholesky factorization method that will factor $\bold R$into two parts like $\bold B$ and $\bold B'$.The method creates for us either an upper or a lower triangular matrix (our choice) for $\bold B$.You can easily reinvent the Cholesky method if you multiply the symbols for two triangular matrices like $\bold B$ and $\bold B'$ and notice the procedure that works backwards from $\bold R$ to $\bold B$.The experimenter seeks not $\bold B$, however, but its inverse, the matrix that takes us from the experimental residuals to the ideal optimization residuals that are uncorrelated and of equal energies. The Cholesky factorization costs N3 computations, which is about the same as the cost of the matrix inversion of $\bold R$ or $\bold B$.For geophysical maps and other functions on Cartesian spaces, the Prediction Error Filter (PEF) accomplishes the same general goal and has the advantage that we have already learned how to perform the operation using operators instead of matrices.

The multivariate spectrum of experimental residuals $\bold y$ is the matrix $\bold R = E ( \bold y \bold y')$. For optimum model finding, the experimental residuals (squared) should be weighted inversely (matrix inverse) by their multivariate spectrum.

If I were a little stronger at analysis (or rhetoric) I would tell you that the optimizers preconditioned variable $\bold p$is the statisticians IID (Independent Identically Distributed) random variable. For stationary (statistically constant) signals and images, $\bold A_m$is the model-space PEF. Echo soundings and interval velocity have statistical properties that change with depth. There $\bold A_m$is a diagonal weighting matrix (perhaps before or after a PEF).