Next: Super Dix Up: Clapp: Multiple realizations Previous: Motivation

Missing Data

The missing data problem is probably the simplest to understand and interpret results. We begin by binning our data onto a regular mesh. For $\bf L$ in fitting goals (2) we will use a selector matrix $\bf J$ ,which is `1' at locations where we have data and `0' at unknown locations. As an example, let's try to interpolate a day's worth of data collected by SeaBeam (Figure ), which measures water depth under and to the side of a ship Claerbout (1999).

sea.init
Figure 1 Depth of the ocean under ship tracks.

Figure shows the result of estimating a PEF from the known data locations and then using it to interpolate the entire mesh. Note how the solution has a lower spatial frequency as we move away from the recorded data. In addition, the original tracks of the ship are still clearly visible.

sea.pef
Figure 2 Result of using a PEF to interpolate Figure

, taken from GEE Claerbout (1999).

If we look at a histograms of the known data and our estimated data we can see the effect of the PEF. The histogram of the known data has a nice Gaussian shape. The predicted data is much less Gaussian with a much lower variance. We want estimated data to have the same statistical properties as the known data (for a Gaussian distribution this means matching the mean and variance).

pef.histo
Figure 3 Histogram for the known data (solid lines) and the estimated data (`*'). Note the dissimilar shapes.

Geostatisticians are confronted with the same problem. They can produce smooth, low frequency models through kriging, but must add a little twist to get model with the statistical properties as the data. To understand how, a brief review of kriging is necessary. Kriging estimates each model point by a linear combination of nearby data points. For simplicity lets assume that the data has a standard normal distribution. The geostatistician find all of the points m₁ .... m_n around the point they are trying to estimate m₀. The vector distance between all data points $\bf d_{ij}$ and each data point and the estimation point $\bf d_{i0}$ are then computed. Using the predefined covariance function estimate C, a covariance value is then extracted between all known point pairs C_ij and between known points and estimation point C_i0 at the given distances $\bf d_{ij}$ and $\bf d_{i0}$ (Figure ). They compute the weights (w₁ ... w_n) by solving the set of equations implied by

$\begin{displaymath} \left[ \begin{array} {cccc} C_{11} &...& C_{1n} & 1 \ . &..... ...y} {c} C_{10} \ . \ . \ . \ C_{n0} \ 1\end{array}\right] .\end{displaymath}$ (3)

Estimating m₀ is then simply,

$\begin{displaymath} m_0= \sum_{i=1}^{n} w_i m_i.\end{displaymath}$ (4)

To guarantee that the matrix in equation (3) is invertible geostatisticians approximate the covariance function through a linear combination of a limited set of functions that guarantee that the matrix in equation (3) is positive-definite and therefore invertible.

covar-def
Figure 4 Definition of the terms in equation (3). A vector is drawn between two points. The covariance at the angle and distance describing the vector is then selected.

The smooth models provided by kriging often prove to be poor representations of earth properties. A classic example is fluid flow where kriged models tend to give inaccurate predictions. The geostatistical solution is to perform Gaussian stochastic simulation, rather than kriging, to estimate the field Deutsch and Journel (1992). There are two major differences between kriging and simulation. The primary difference is that a random component is introduced into the estimation process. Stochastic simulation, or sequential Gaussian simulation, begins with a random point being selected in the model space. They then perform kriging, obtaining a kriged value m₀ and a kriging variance $\sigma_k$ .Instead of using m₀ for the model value we select a random number $\beta$ from a normal distribution. We use as our model point estimate m_i,

$\begin{displaymath} m_i = m_0 + \sigma_k \beta.\end{displaymath}$ (5)

We then select a new point in the model space and repeat the procedure. To preserve spatial variability, a second change is made: all the previously estimated points are treated as `data' when estimating new points guaranteeing that the model matches the covariance estimate. By selecting different random numbers (and/or visiting model points in a different order) we will get a different, equiprobable model estimate. The advantage of the models estimated through simulation is that they have not only the covariance of they data, but also the variance. As a result the models estimated by simulation give more realistic fluid flow measurements compared to a kriged model. In addition, by trying different realizations fluid flow variability can be assessed.

The difference between kriging and simulation has a corollary in our least squares estimation problem. To see how let's write our fitting goals in a slightly different format,

$\begin{eqnarray} \bf r_d &\approx&\bf d - \bf J \bf m \nonumber \ \bf r_m &\approx&\epsilon \bf A\bf m ,\end{eqnarray}$

(6)

where $\bf r_d$ is our data residual and $\bf r_m$ is our model residual. The model residual is the result of applying our covariance estimate $\bf A$ to our model estimate. The larger the value of a given $\bf r_m$ , the less that model point makes sense with its surrounding points, given our idea of covariance. This is similar to kriging variance. It follows that we might be able to obtain something similar to the geostatistician's simulations by rewriting our fitting goals as

$\begin{eqnarray} \bf d &\approx&\bf J \bf m \nonumber \ \sigma \bf v &\approx&\epsilon \bf A\bf m ,\end{eqnarray}$

(7)

where $\bf v$ is a vector of random normal numbers and $\sigma$ is a measure of our estimation uncertainty. By adjusting $\sigma$ we can change the distribution of $\bf m$ . For example, let's return to the SeaBeam example. Figure shows four different model estimations using a normal distribution and various values for the variance. Note how the texture of the model changes significantly. If we look at a histogram of the various realizations (Figure ), we see that the correct distribution is somewhere between our second and third realization.

We can get an estimate of $\sigma$ , or in the case of the missing data problem $\frac{\sigma}{\epsilon}$ , by applying fitting goals (6). If we look at the variance of the model residual $\sigma(\bf m_r)$ and $\sigma(\bf d)$ we can get a good estimate of $\sigma$ ,

$\begin{displaymath} \sigma = \frac{ \sigma(\bf r_m) }{ \sigma(\bf d)} .\end{displaymath}$ (8)

distrib
Figure 5 Four different realizations with increasing $\sigma$ in fitting goals (7).

movie.distir
Figure 6 Histogram of the known data (solid line) and the four different realizations of Figure

Figure shows eight different realizations with a random noise level calculated through equation (8). Note how we have done a good job emulating the distribution of the known data. Each image shows some similar features but also significant differences (especially note within the `V' portion of the known data).

sea.movie
Figure 7 Eight different realizations of the SeaBeam interpolation problem and their histograms. Note how the realizations vary away from the known data points.

A potentially attractive feature of setting up the problem in this manner is that it easy to have both a space-varying covariance function (a steering filter or non-stationary PEF) along with a non-stationary variance. Figure shows the SeaBeam example again with the variance increasing from left to right.

non-stat
Figure 8 Realization where the variance added to the image increases from left to right.

Next: Super Dix Up: Clapp: Multiple realizations Previous: Motivation

Stanford Exploration Project
9/5/2000

	$\begin{eqnarray} \bf r_d &\approx&\bf d - \bf J \bf m \nonumber \ \bf r_m &\approx&\epsilon \bf A\bf m ,\end{eqnarray}$
		(6)

	$\begin{eqnarray} \bf d &\approx&\bf J \bf m \nonumber \ \sigma \bf v &\approx&\epsilon \bf A\bf m ,\end{eqnarray}$
		(7)