%+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ %+ + %+ NOTICE: This article is also available formatted with illustrations. + %+ + %+ SEE: http://sepwww.stanford.edu/oldreports/sep61/ + %+ + %+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ \chapter{Resolution and random signals} \setcounter{chapter}{\RAND} The accuracy of measurements on observed signals is limited not only by practical realities, but also by certain fundamental principles. A well-known example is the time-bandwidth product in Fourier transformation theory called the ``uncertainty principle.'' \par Observed signals often look random and often they are modeled by filtered random numbers. In this chapter we will see many examples of signals built from random numbers and see how the nomenclature of statistics applies to them. Fundamentally, this chapter characterizes {\it resolution,} resolution of frequency and arrival time, and the statistical resolution of signal amplitude and power as functions of time and frequency. \par We will see $\sqrt{n}$ popping up everywhere. For example, signals that are theoretically uncorrelated generally appear to be weakly correlated at a level of $1/\sqrt{n}$ where $n$ is the number of independent points in the signal. \par Measures of resolution (which are called variances, tolerances, uncertainties, bandwidths, durations, spreads, spans, etc.) often interact with one another so that experimental change to reduce one must necessarily increase another or some combination of the others. This chapter exhibits basic cases where such conflicting interactions occur. \par To avoid confusion I introduce the unusual notation $\Lambda$ where $\Delta$ is commonly used. Notice the letter $\Lambda$ resembles the letter $\Delta$ and $\Lambda$ connotes length without being confused with wavelength. Lengths on the time and frequency axes are defined as follows. \begin{tabbing} indent \= dt,,df,space \= description.conjugate.conjugate.conjugate \= \kill \>$dt$, $df$ \>mesh intervals in time and frequency \\ \>$\Delta t$, $\Delta f$ \>mesh intervals in time and frequency \\ \>$\Delta T$ $\Delta F$ \>extent of time and frequency axis\\ \>$\Lambda T$, $\Lambda F$ \>time duration and spectral bandwidth of a signal \end{tabbing} \par I wish I could give you formal definitions of time duration $\Lambda T$ and spectral bandwidth $\Lambda F$. A variety of defining equations are easy to write, and many are in general use. The main idea is that the time span $\Lambda T$ or the frequency span $\Lambda F$ should be able to include most of the energy but need not contain it all. The time duration of a damped exponential function is infinite if by duration you mean the span of nonzero function values. However, for practical purposes the time span is generally chosen as the time required for the amplitude to decay to $e^{-1}$ of its original value. For many functions the span is defined by the span between points on the time or frequency axis where the curve (or its envelope) drop to half of the maximum value. Since there is no mathematically tractable and univerally acceptable definition for time span and spectral bandwidth this chapter tends to be more descriptive than analytic. \section{TIME-FREQUENCY RESOLUTION} \par Scaling a function to be narrower in one domain also scales it to be wider in the other domain. This is a consequence of Fourier transforms being built from $e^{i\omega t}$. Scaling $\omega$ implies inverse scaling of $t$ to keep the product $\omega t$ constant. For example the FT of a rectangle is a sinc. Making the rectangle narrower broadens the sinc in proportion because $\omega t$ is constant. Here we will quantify this relationship between the signal width $\Lambda T$ and its spectral bandwidth $\Lambda F$. \par A pure sinusoidal wave has a clearly defined frequency, but it is spread over the infinitely long time axis. At the other extreme is a delta function, which is nicely compressed to a point on the time axis but contains a mixture of all frequencies. Here we examine how the width of a function in one domain relates to that in the other. By the end of this section, we will conclude that for any function, the time duration $\Lambda T$ and the spectral bandwidth $\Lambda F$ are related by \begin{equation} \Lambda F \; \Lambda T \quad \geq \quad 1 \label{4-1-1} \end{equation} This relationship is often called the {\it uncertainty principle.} \par Since we are unable to find a precise and convenient analysis for the definitions of $\Lambda F$ and $\Lambda T$, the inequality~(\ref{4-1-1}) is not strict. What is important is that rough equality in (\ref{4-1-1}) is observed for many simple functions, but for others the inequality can be extremely slack (far from equal). Strong {\it inequality} arises from all-pass filters. An all-pass filter leaves the spectrum unchanged, and hence $\Lambda F$ unchanged, but it can spread out the signal arbitrarily, increasing $\Lambda T$ arbitrarily. Thus the time-bandwidth maximum is unbounded for all-pass filters. Some people say that the Gaussian function has the minimum product in ~(\ref{4-1-1}) but that really depends on a particular method of measuring $\Lambda F$ and $\Lambda T$. \subsection{Counter example refuted} \par It is easy to misunderstand the uncertainty principle. An oversimplification of it is to say that it is ``impossible to know the frequency at any particular time.'' This oversimplification leads you to think about a truncated sinusoid, such as in Figure~\ref{rand-windcos}. You know the frequency exactly, so $\Lambda F$ seems zero, whereas $\Lambda T$ is finite, and this seems to violate (\ref{4-1-1}). But what the figure shows is that the truncation of the sinusoid has broadened the frequency band. More particularly the impulse function in the frequency domain has been convolved by the sinc function which is the Fourier transform of the truncating rectangle function. \plot{rand-windcos}{1.00in} {\FIGS/rand/windcos} {(rand-windcos) Windowed sinusoid and its Fourier transform. \todo{Redo with cosine on right edge.} } \plot{rand-uncertain}{2.56in} {\FIGS/rand/uncertain} {(rand-uncertain) Sampled Gaussian functions and their Fourier transforms for in vectors of length $n$ = 16, 64, and 256. } \subsection{Measuring the time-bandwidth product} \par Now examine Figure~\ref{rand-uncertain} containing sampled Gaussian functions and their Fourier transforms. The Fourier transform of a Gaussian is well-known to be another Gaussian function as the plot confirms. I adjusted the width of each Gaussian so that it would be about equal in each of the two domains. The Gaussians were sampled at various values of $n$, increasing in steps by a factor of 4. You can measure the width dropping by a factor of 2 at each step. For those of you who have already learned about the uncertainty principle it may seem paradoxical that the function's width is dropping in both the time domain and the frequency domain. \par The resolution of the paradox is that the physical time axis or the frequency axis length is varying as we change $n$ (even though the plot length is scaled to a constant on the page). We need to associate a physical mesh with the computational mesh. Two methods of associating a physical mesh with a computational mesh were described in chapter \FT\ on page~\pageref{'mesh'}. In real physical space as well as Fourier transform space, the object remains a constant size as the mesh is refined. \par Let us read from Figure~\ref{rand-uncertain} values for the widths $\Lambda F$ and $\Lambda T$. On the top row with $n=16$, I pick a width of about 4 points, and this seems to include about 90\% of the area under the function. So for this signal (with the widths roughly equal in each domain) it seems that $\Lambda T = \sqrt{n} dt$ and $\Lambda F = \sqrt{n}df$. Using the relation between $dt$ and $df$ found in chapter \FT\ equation (\ref{ft.dtdf}) that $dt\,df= 1/n$, the product becomes $\Lambda T \Lambda F$ = 1. \par We could also confirm the inequality (\ref{4-1-1}) by simple functions for which we know the analytic transforms. For example, consider an impulse function in time. Then $\Lambda T = dt$ and the Fourier transform occupies the entire frequency band, i.e.~$\Lambda F = 1/dt$. Thus again the product is $\Lambda T \Lambda F$ = 1. \subsection{Uncertainty principle in physics} \par The inequality (\ref{4-1-1}) gets its name, the ``uncertainty principle,'' from its interpretation in quantum mechanics. Observations of subatomic particles show they behave like waves with spatial frequency proportional to particle momentum. The classical laws of mechanics enable prediction of the future of a mechanical system by extrapolation from the currently known position and momentum. But because of the wave nature of matter with momentum proportional to spatial frequency, such prediction requires simultaneous knowledge of both the location and the spatial frequency of the wave. This is impossible, as we see from (\ref{4-1-1}), hence the word, ``uncertainty.'' \subsection{Gabor's proof of the uncertainty principle} \par Although it is easy to verify the uncertainty principle in many special cases, it is not easy to deduce it. The difficulty begins from finding a definition of the width of a function that leads to a tractable analysis. One possible definition is to use a second moment, that is, to define $\Lambda T$ by \begin{equation} (\Lambda T)^2 \eq { \int \ t^2 \, b(t)^2\, dt \over \int b(t)^2\, dt } \label{rand.2nd_moment} \end{equation} The spectral bandwidth $\Lambda F$ is defined likewise. With these definitions, Dennis Gabor prepared a widely reproduced proof. I'll omit his proof here; it is not an easy proof; it is widely available; and the definition (\ref{rand.2nd_moment}) seems inappropriate for a function we often use, the sinc function, i.e.~the FT of a step function. Since the sinc function drops off as $t^{-1}$, its width $\Lambda T$ defined with (\ref{rand.2nd_moment}) is infinity, which is unlike the more human measure of width, the distance to the first axis crossing. \subsection{My risetime proof of the uncertainty principle} \par In ``\FGDP'' (FGDP) I came up with a proof of the uncertainty principle that is appropriate for causal functions. That proof is included directly below, but I recommend the beginning reader skip over it because it is somewhat lengthy. It is included because this book is oriented towards causal functions, the proof is not well known, and it is now improved from FGDP. \par A similar and possibly more basic concept than the product of time and frequency spreads is the relationship between spectral bandwidth and {\it rise time} of a system response function. The {\it rise time} $\Lambda T$ of a system response is defined as follows: when you kick a physical system with an impulse function, it usually responds rapidly rising to a some maximum level, and then drops off more slowly towards zero. The quantitative value of the rise time is also somewhat arbitrarily, generally taken to be span between the time of excitation and the time at which the system response is more than half way up to its maximum. \par {\it Tightness} (nearness to equality) in the inequality (\ref{4-1-1}) is associated with minimum phase. {\it Slackness} (farness from equality) in the inequality (\ref{4-1-1}) would arise from a filter with an additional all-pass component. Slackness could also come from a decay time that is more rapid than the rise time. Slackness could also result from other combinations of rises and falls such as random combinations. Minimum phase systems generally respond rapidly compared to the rate at which they later decay. Focusing our attention on such systems, we can now seek to derive the inequality (\ref{4-1-1}) applied to rise time and bandwidth. \par The first step is to choose a definition for rise time. I have found a tractable definition of rise time to be \begin{equation} {1 \over \Lambda T} \eq {\int^{\infty}_0 {1 \over t}\, b(t)^2\, dt \over \int^{\infty}_0 b(t)^2\, dt} \label{4-1-2} \end{equation} where $b(t)$ is the response function under consideration. Equation (\ref{4-1-2}) defines $\Lambda T$ by the first negative moment. Since this is unfamiliar, consider two examples. Taking $b(t)$ to be a step function, the numerator integral diverges, giving the desired $\Lambda T = 0$ rise time. For another example, take $b(t)^2$ to grow linearly from zero to $t_0$ and then vanish. Then the rise time $\Lambda T$ is $t_0/2$, again reasonable. It is curious that $b(t)$ could grow as $\sqrt{t}$ which rises with infinite slope at $t=0$ and $\Lambda T$ would not be pushed to 0. \subsubsection{Proof by way of the dual problem} \par Although the $Z$ transform method is a great aid in studying cases where divergence (as $1/t$) plays a role, it has the disadvantage that it destroys the formal interchangeability between the time domain and the frequency domain. To take advantage of the analytic simplicity of the $Z$ transform we consider instead the dual to the rise-time problem. Instead of a signal whose square vanishes identically at negative time we have a spectrum $\overline{B} (1/Z) B(Z)$ which vanishes at negative frequencies. We measure how fast this spectrum can rise after $\omega = 0$. We will find this to be related to the time duration $\Lambda T$ of the complex signal $b_t$. More precisely, we will now define the lowest significant frequency component $ \Lambda F$ in the spectrum analogously to (\ref{4-1-2}) to be \begin{equation} {1 \over \Lambda F} \eq \int^{\infty}_{-\infty} {1 \over f} \ \overline{B} B\, df \eq \int^{\infty}_{-\infty} \overline{B} B\, {d \omega \over \omega} \label{4-1-4} \end{equation} where we have assumed the spectrum is normalized, i.e.~the zero lag of the auto-correlation of $b_t$ is unity. Now recall the bilinear transform, equation~(\ref{zz.bilin}), which represents $1/(-i\omega)$ as the coefficients of $.5(1+Z)/(1-Z)$, namely, $(\ldots 0, 0, 0, 0.5, 1, 1, 1 \ldots )$. The pole right on the unit circle at $Z = 1$ causes some nonuniqueness. Because $1/i \omega$ is an imaginary odd frequency function, we'll want an odd expression (such as on page~\pageref{'odd_integral'}) to insert into (\ref{4-1-4}) such as \begin{equation} {1 \over -i \omega } \eq {(\cdots - Z^{-2} - Z^{-1} + 0 + Z + Z^2 + \cdots) \over 2} \label{4-1-5} \end{equation} Using limits on the integrals for time-sampled functions and inserting (\ref{4-1-5}) into (\ref{4-1-4}) gives \begin{equation} {1 \over \Lambda F} \eq {-i \over 2\pi} \int^{+\pi}_{-\pi} {1 \over 2} (\cdots -Z^{-2} -Z^{-1} + Z +Z^2 + \cdots )\ \overline{B} \left( {1 \over Z} \right) \, B(Z)\, d \omega \label{4-1-6} \end{equation} Let $s_t$ be the autocorrelation of $b_t$. Since any integral around the unit circle of a $Z$-transform polynomial selects the coefficient of $Z^0$ of its integrand we have \begin{eqnarray} {1 \over \Lambda F} &=& {-i \over 2} \left[ (s_{-1} - s_1) + (s_{-2} - s_2) + (s_{-3} - s_3) + \cdots \right] \eq \sum^{\infty}_{t = 1} - \Im s_t \label{rand.wide} \\ {1 \over \Lambda F} &=& \sum^{\infty}_{t = 1} - \Im s_t \quad \leq \quad \sum^{\infty}_{t = 1} |s_t| \label{4-1-8} \end{eqnarray} The height of the autocorrelation has been normalized to $s_0=1$. The sum in (\ref{4-1-8}) is an integral representing area under the $|s_t|$ function. So the area is a measure of the autocorrelation width $\Lambda T_{\rm auto}$. Thus \begin{equation} {1 \over \Lambda F} \quad \leq \quad \sum^{\infty}_{t = 1} |s_t| \eq \Lambda T_{\rm auto} \label{4-1-9} \end{equation} \par Finally, we must relate the duration of a signal $\Lambda T$ to the duration of its autocorrelation $\Lambda T_{\rm auto}$. Generally speaking, it is easy to find a long signal that has short autocorrelation. Just take an arbitrary short signal and convolve it by a lengthy all-pass filter. Conversely, you can't get a long autocorrelation function from a short signal. A good case would be the autocorrelation of a rectangle function that is a triangle. Nominally, the triangle seems to be twice as long, but considering that the triangle tapers down, it is reasonable to assert the $\Lambda T$'s are the same. So we conclude \begin{equation} \Lambda T_{\rm auto} \label{4-1-10} \quad \leq \quad \Lambda T \end{equation} Inserting into (\ref{4-1-9}) we have the uncertainty relation \begin{equation} \Lambda T \ \Lambda F \quad \geq \quad 1 \label{4-1-11} \end{equation} \par Looking back over the proof I feel the basic time-bandwidth idea is in the {\it equality}~(\ref{rand.wide}) for which I have no concise verbalization. The {\it inequality} arises from $\Lambda T_{\rm auto} < \Lambda T$ which is a simple idea. \exer \item Consider $B(Z) = [1 - (Z/Z_0)^n]/(1 -Z/Z_0)$ in the limit $Z_0$ goes to the unit circle. Sketch the signal and its squared amplitude. Sketch the frequency function and its squared amplitude. Choose $\Lambda F$ and $\Lambda T$. \item A time series made up of two frequencies may be written as $$b_t = A \cos \omega_1 t + B\sin \omega_1 t + C \cos \omega_2 t + D\sin \omega_2 t$$ Given $\omega_1$, $\omega_2$, $b_0$, $b_1$, $b_2$, $b_3$ show how to calculate the amplitude and phase angles of the two sinusoidal components. \exerend \section{FOURIER TRANSFORMS OF RANDOM NUMBERS} \par Many real signals are complicated and barely comprehensible. In experimental work, we commonly transform such data. To better understand what this means, it will be worthwhile examining signals made from random numbers. \plot{rand-nrand}{2.75in} {\FIGS/rand/nrand} { (rand-nrand) Fourier cosine transforms of vectors containing random numbers. N is the number of components in the vector. \label{'rand-nrand'} } \par Figure~\ref{rand-nrand} shows discrete Fourier transforms of random numbers. The basic conclusion from this figure is that transforms of random numbers look like more random numbers. \par A random series containing all frequencies is called a {\it white} noise series because the color white is made from roughly equal amounts of all colors. Any series made by independently chosen random numbers is said to be an {\it independent} series. An independent series must be white, but a white series need not be independent. \plot{rand-pad}{1.62in} {\FIGS/rand/pad} { (rand-pad) Zero padded random numbers and their Fourier transforms. } Figure~\ref{rand-pad} shows Fourier transforms of random numbers surrounded by zeros, in other words, the random numbers have been {\it zero padded.} Since each vector of random numbers is the same length (1024 points including both sides of the even function with 513 points shown) the transforms are also the same length. The top signal has less randomness than the second trace (16 random numbers versus 64). Thus the top FT is smoother than the lower ones. The best way to understand this figure is to think of the left-hand signal as a frequency function. When higher frequencies are present, the right-hand signal oscillates faster. \subsection{Bandlimited noise } Figure~\ref{rand-shift} shows bursts of 25 random numbers at various shifts, and their Fourier transforms. You can think of either side of the figure as the time domain while the other side is the frequency domain. (See Chapter \FT\ page~\pageref{'plot_format'} for a description of the different ways of interpreting plots of one side of Fourier transform pairs of an even function.) I like to think of the left side as the Fourier domain and the right side as the signals. Then the signals seem to be sinusoids of a constant frequency (called the {\it center} frequency) and an amplitude that is modulated at a slower rate (called the beat frequency). Observe that the center frequency is related to the {\it location} of the random bursts and that the beat frequency is related to the {\it bandwidth} of the noise burst. \plot{rand-shift}{1.71in} {\FIGS/rand/shift} { (rand-shift) Shifted, zero padded random numbers in bursts of 25 numbers. \todo{Remake with all bursts identical.} } \par You can also think of Figure~\ref{rand-shift} as having one sided frequency functions on the left, and the right side as being the {\it real part} of the signal. The real parts are cosine like, whereas the imaginary parts (not shown) are sine like, with the same envelope function. \par You might have noticed that the bottom plot in Figure~\ref{rand-shift} with the Nyquist frequency modulated beats, seems to have about twice as many beats as the two plots above it. This can be explained as an edge effect. The noise burst near the Nyquist frequency is really twice as wide as shown because it is mirrored about the Nyquist frequency into negative frequencies. Likewise, the top figure is not modulated at all, but the signal itself has a frequency that matches the beats on the bottom figure. \section{TIME-STATISTICAL RESOLUTION} \footnote{Corrections and clarifications were made in this section by Gilles Darche. } If you flipped a fair coin 1000 times, it is unlikely that you would get exactly 500 ``heads'' and 500 ``tails.'' More likely the head count would lie somewhere between 400 and 600. Or would it lie in another range? The theoretical value, called the {\it mean} or the {\it expectation} is 500. The value from your experiment is called the {\it sample mean}. How much difference $\Lambda m$ should we expect between the sample mean $\hat m$ and the true mean $m$? Both the coin flips $x$ and your sample mean $\hat m$ are {\it random variables}. Your coin flip experiment could be repeated many times generally giving a different result each time. This concept will be formalized as {\it the variance of the sample mean} which is the expected squared difference between the true mean and the mean of your sample. \par The problem of estimating the statistical parameters of a time series, such as its mean, also appears in seismic processing. Effectively, we deal with seismic traces of finite duration, extracted from infinite sequences whose parameters can only be estimated from the finite set of values available in these seismic traces. Since the knowledge of these parameters, such as signal/noise ratio, can play an important role during the processing, it can be useful not only to estimate them, but also to have an idea of the error made in this estimation. \subsection{Ensemble} The ``true value'' of the mean could be defined by flipping the coin $n$ times and conceiving of $n$ going to infinity. A more convenient definition of ``true value'' is that the experiment is imagined as having been done separately under identical conditions by an infinite number of people (an ensemble). The {\it ensemble} may seem a strange construction of the imagination, nonetheless much literature in statistics and in the natural sciences uses the idea of an ensemble. Let's say that the ensemble is defined by all the possible values the variable can take at one particular time, associated with a probability of occurrence. Then, the ensemble idea enables us to define a time-variable mean (the sum of these values weighted by the probabilities), for example for coins that change with time. \subsection{Expectation and Variance} A conceptual average over the ensemble, called an expectation, is denoted by the symbol $\E$. The index for summation over the ensemble is never shown explicitly; every random variable is presumed to have one. Thus, the true mean at time $t$ is defined as ${m_x}(t)=\E(x_t)$. The mean can vary with time: \begin{equation} m_x(t) \eq \E[x(t)] \label{4-2-2} \end{equation} Likewise the {\it power} is \begin{equation} p_x(t) \eq \E[x(t)^2] \end{equation} The {\it variance} $\sigma^2$ is defined to be the power after the mean is removed, i.e. \begin{equation} \sigma_x(t)^2 \eq \E\, [x(t) - m_x(t)]^2 \label{4-2-4} \end{equation} Conventionally, $\sigma^2$ is called the variance, and $\sigma$ is called the {\it standard deviation}. \par For notational convenience it is customary to write $m(t)$, $p(t)$, $\sigma(t)$, and $x(t)$ simply as $m$, $p$, $\sigma$, and $x_t$ using the verbal context to specify whether $m$, $p$, and $\sigma$ are time variable or constant. For example, the standard deviation of the seismic amplitudes on a seismic trace before correction of spherical divergence decreases with time, since these amplitudes are expected to be ``globally'' smaller as time goes on. \par When manipulating algebraic expressions the symbol $\E$ behaves like a summation sign, namely \begin{equation} \E \quad \equiv \quad (\lim N \rightarrow \infty) \quad {1 \over N} \sum^N_1 \label{4-2-5} \end{equation} Note that the summation index is not given, since the sum is over the ensemble, not time. To get some practice with the expectation symbol $\E$, we can reduce equation~(\ref{4-2-4}) \begin{equation} \sigma_x^2 \eq \E\, [(x_t - m_x)^2] \eq \E(x_t^2) \ - \ 2 m_x \E(x_t) + m_x^2 \eq \E(x_t^2) \ - \ m_x^2 \end{equation} which says the energy is the variance plus the squared mean. \subsection{Sample mean} Now let $x_t$ be a time series made up from identically distributed random numbers: $m_x$ and $\sigma_x$ do not depend on time. Let's also suppose that they are {\it independently} chosen; this means in particular that for any different times $t$ and $s$ ($t\neq s$): \begin{equation} \E(x_tx_s) \eq \E(x_t) \E(x_s) \end{equation} Suppose we have a sample of $n$ points of $x_t$ and are trying to determine the value of $m_x$. We could make an estimate $\hat m_x$ of the mean $m_x$ with the formula \begin{equation} \hat m_x \eq {1 \over n} \sum^n_{t = 1} x_t \label{4-2-6} \end{equation} \par A somewhat more elaborate method of estimating the mean would be to take a weighted average. Let $w_t$ define a set of weights normalized so that \begin{equation} \sum w_t \eq 1 \label{4-2-7} \end{equation} With these weights the more elaborate estimate $\hat{m}$ of the mean is \begin{equation} \hat m_x \eq \sum w_t \, x_t \label{4-2-8} \end{equation} Actually (\ref{4-2-6}) is just a special case of (\ref{4-2-8}) where the weights are $w_t = 1/n$. \par Further, the weights could be {\it convolved} on the random time series, to compute {\it local} averages of this time series, thus smoothing it. The weights are simply a filter response where the filter coefficients happen to be positive and cluster together. An example in Figure~\ref{rand-walk} shows a random walk function with itself smoothed locally. \plot{rand-walk}{1.96in}{\FIGS/rand/walk}{(rand-walk) Random walk and itself smoothed (and shifted downward). } \subsection{Variance of the sample mean} Our objective here is to calculate how far the estimated mean $\hat{m}$ is likely to be from the true mean $m$ for a sample of length $n$. This departure is called the {\it variance of the sample mean}, and it is given by $(\Lambda m)^2=\sigma_{\hat m}^2$, where \begin{eqnarray} \sigma_{\hat m}^2 &= & \E\, [(\hat{m} - m)^2] \label{4-2-9} \\ &= & \E\, \left\{ [(\sum w_t x_t) - m]^2 \right\} \label{4-2-10} \end{eqnarray} Now use the fact that $m = m\sum w_t = \sum w_t m$: \begin{eqnarray} \sigma_{\hat m}^2 &= &\E\, \left\{ \left[\sum_t w_t (x_t - m) \right]^2 \right\} \label{4-2-11}\\ &= &\E\, \left\{ \left[\sum_t w_t (x_t - m) \right] \, \left[\sum_s w_s (x_s - m) \right] \right\} \label{4-2-12}\\ &= &\E\, \left[\sum_t \sum_s w_t w_s (x_t - m)(x_s - m) \right] \label{4-2-13} \end{eqnarray} The step from (\ref{4-2-12}) to (\ref{4-2-13}) follows because \begin{equation} (a_1+a_2+a_3)\, (a_1+a_2+a_3) \eq {\rm sum \; of} \quad \left[ \begin{array}{ccc} a_1 a_1 & a_1 a_2 & a_1 a_3 \\ a_2 a_1 & a_2 a_2 & a_2 a_3 \\ a_3 a_1 & a_3 a_2 & a_3 a_3 \end{array} \right] \end{equation} The expectation symbol $\E$ may be regarded as another summation that can be done after, as well as before, the sums on $t$ and $s$, so \begin{equation} \sigma_{\hat m}^2 \eq \sum_t \sum_s w_t \, w_s\, \E\, \left[ (x_t - m)(x_s - m) \right] \label{4-2-14} \end{equation} If $t\neq s$, as $x_t$ and $x_s$ are independent of each other, the expectation $\E[(x_t - m)(x_s - m)]$ will vanish. If $s = t$, then the expectation is the variance defined by (\ref{4-2-4}). Expressing the result in terms of the Kronnecker delta, $\delta_{ts}$ (which equals unity if $t=s$, and vanishes otherwise) gives \begin{eqnarray} \sigma_{\hat m}^2 &= & \sum_t \sum_s w_t \, w_s \, \sigma_x^2 \delta_{ts} \label{4-2-15}\\ \sigma_{\hat m}^2 &= & \sum_t w^2_t \, \sigma_x^2 \label{4-2-16} \\ \sigma_{\hat m} &= & \sigma_x \ \sqrt{ \sum_t w^2_t } \label{4-2-17} \end{eqnarray} For $n$ weights, each of size $1/n$, the standard deviation of the sample mean is \begin{equation} \Lambda m_x \eq \sigma_{\hat m_x} \eq \sigma_x \ \sqrt{ \sum^n_{t=1} \left( {1 \over n}\right)^2 } \eq {\sigma_x \over \sqrt{n} } \label{rand.vsm} \end{equation} This is the most important property of random numbers that is not intuitively obvious. Informally, the result (\ref{rand.vsm}) says this: Given a sum $y$ of terms with random polarity, whose theoretical mean is $0$: \begin{equation} y \eq \underbrace{\pm 1 \pm 1 \pm 1 \cdots}_{n\ {\rm terms}} \end{equation} the sum $y$ is a random variable whose standard deviation is $\sigma_y=\sqrt{n}=\Lambda y$. An experimenter who doesn't know the mean is zero would report the mean of $y$ is $\E (y) = \hat y \pm \Lambda y$, where $\hat y$ is the experimental value. \subsection{Dilemma in estimating a time variable mean} If you are trying to estimate the mean of a random series that has a time-variable mean then you face a basic dilemma. Including many numbers in the sum to get $\Lambda m$ to be small conflicts with the possibility of seeing $m_t$ change during the measurement. Let a signal be multiplied by a window function in which we will estimate the mean of the signal. Let $\Lambda T = n\,dt$ be the length of the window. Then (\ref{rand.vsm}) becomes \begin{equation} {(\Lambda m)^2 \over \sigma^2 }\, {\Lambda T \over dt} \eq 1 \label{4-2-19} \label{rand.meanfluct} \end{equation} It is desirable to have both $\Lambda m$ and $\Lambda T$ as small as possible. You see that the fluctuation in your estimation of the mean drops off only as the square root of the sample size. Equation~(\ref{rand.meanfluct}) depends on the assumption that successive samples of the time series are independent. If they are not independent, I imagined that things would always be worse, and the ``$=$'' in equation~(\ref{rand.meanfluct}) would be replaced by a ``$\ge$''. This turns out to be overly pessimistic. In the unusual case that successive lags of the signal are anticorrelated, the ``$=$'' becomes a ``$\le$'' as shown in the following proof by Gilles Darche: Observe what happens when the samples $x_t$ and $x_s$ are {\it not} independent. Suppose the weights $w_t$ uniform ($w_t=1/n$). Following (\ref{4-2-14}) and (\ref{rand.vsm}): \begin{eqnarray} (\Lambda m)^2 &=& {\sigma^2\over n} + {1\over n^2}\sum\sum_{t\neq s}\E[(x_t-m)(x_s-m)] \\ &=& {\sigma^2\over n} + {2\over n^2}\sum_{t=1}^{n-1}\sum_{s=t+1}^n\E[(x_t-m)(x_s-m)] \\ &=& {\sigma^2\over n} + {2\over n^2}\sum_{t=1}^{n-1}\sum_{p=1}^{n-t}\E[(x_t-m)(x_{t+p}-m)] \end{eqnarray} Suppose $x_t$ is {\it stationary}: it implies that $\E[(x_t-m)(x_{t+p}-m)]$ depends only on $p$, and not on $t$. This is the autocorrelation of $x_t$ at lag $p$, which we write $s_p$. So: \begin{equation} (\Lambda m)^2 \eq {\sigma^2\over n} + {2\over n^2}\sum_{t=1}^{n-1}\sum_{p=1}^{n-t}s_p \eq {\sigma^2\over n} + {2\over n^2}\sum_{p=1}^{n-1}\sum_{t=1}^{n-p}s_p \end{equation} Finally: \begin{equation} (\Lambda m)^2 \eq {\sigma^2\over n} + {2\over n^2}\sum_{p=1}^{n-1}(n-p)s_p \label{sm-cor} \end{equation} A correction term appears, which can be positive or negative. So $(\Lambda m)^2$ can indeed be larger or smaller than $\sigma^2/n$. For example, with $n=2$, $(\Lambda m)^2=\sigma^2/2 + s_1/2$ will be smaller than $\sigma^2/2$, if the sequence $x_t$ is anticorrelated ($s_1<0$). %\subsection {Sample variance} %We have seen how to estimate the mean. %To estimate the variance, we compute the {\it sample variance:} %\begin{equation} %{\hat \sigma^2} \eq %{(x_1-{\hat m})^2+\cdots+(x_n-{\hat m})^2 \over n} \eq %{\sum_{t=1}^n(x_t-{\hat m})^2\over n} %\end{equation} %For the mean we use the estimated mean %${\hat m}$ since we do not know the true mean $m$. %What is the error made in estimating the variance? %Gilles--- I'm concerned that if all we are doing is computing %a little bias, it won't have any experimental interest. %People would just multiply the result by the trivial %adjustment $n/(n-1)$. % %In fact, we don't need %in first approximation to work on squared errors %(as for the variance of the sample mean), %because a bias will appear between ${\hat \sigma^2}$ and %the true $\sigma^2$: the expected value of ${\hat \sigma^2}$ is not $\sigma^2$. %Effectively: %\begin{equation} %\E[{\hat \sigma^2}] \eq %{1\over n}\sum_{t=1}^n\E[(x_t-{\hat m})^2] %\end{equation} %Decomposing $(x_t-{\hat m})$ into $(x_t-m)$ and $(m-{\hat m})$, we get: %\begin{eqnarray} %\E[(x_t-{\hat m})^2] &=& \E[(x_t-m)^2+2(x_t-m)(m-{\hat m})+(m-{\hat m})^2] \\ % &=& \sigma^2 + 2\E[(x_t-m)(m-{\hat m})] + {\sigma^2\over n} %\end{eqnarray} %Now: %\begin{equation} %\E[(x_t-m)(m-{\hat m})] \eq \E[(x_t-m){(m-x_1)+\cdots+(m-x_n)\over n}] %\end{equation} %Remembering that we suppose the $x_t$ independent of each other, we get: %\begin{equation} %\E[(x_t-m)(m-{\hat m})] \eq -{\E[(x_t-m)^2]\over n} \eq -{\sigma^2\over n} %\end{equation} %Finally: %$$\E[(x_t-{\hat m})^2] \eq \sigma^2 -{\sigma^2\over n}$$ %\begin{equation} %\E[{\hat \sigma^2}] \eq \sigma^2 - {\sigma^2\over n} %\end{equation} % %So ${\hat \sigma^2}$ is expected to be smaller than $\sigma^2$, %with an error of ${\sigma^2/ n}$. %In the case of a 0-mean series, $\sigma^2=\E[x_t^2]=p$, %and we can write this result as: %\begin{equation} %{|\Lambda p|\over p}={1\over n} %\end{equation} %Once again, we have a dilemma between the accuracy of the estimation, %and the number of samples we have to use in the computation. %If the variance varies with time, %we want to use a small $n$, but we know %the accuracy of the computation will be poor. %We can as well rewrite this result as: %\begin{equation} %{|\Lambda p|\over p}{\Lambda T\over dt}=1 %\label{4-2-25} %\end{equation} % \subsection{Variance of the sample variance} \par The {\it variance of the sample variance} arises in many contexts. Suppose you want to measure the storminess of the ocean. You measure water level as a function of time and subtract the mean. The storminess is the variance about the mean. You measure the storminess in one minute and call it a sample storminess. You compare it to other minutes and other locations and you find they are not all the same. To characterize these differences, we need the {\it variance of the sample variance} $\sigma_{\hat \sigma^2}^2$: once we have an estimation of the variance (the sample variance), we compute the error in this estimation. \par Given a sample of a random time series $x_t$, the true variance of $x_t$ is $\sigma^2_x$, with: \begin{equation} \sigma^2_x \eq \E[(x_t-m)^2] \end{equation} We have an {\it estimate} of $\sigma_x$ with: \begin{equation} \hat \sigma_x^2 \eq {(x_1-m)^2+\cdots+(x_n-m)^2\over n} \end{equation} (This sample variance $\hat \sigma_x^2$ is computed using the true mean $m$. You can get a slightly different result using $\hat m$). We would like to know the error we make in this estimation, i.e. the variance of the sample variance. We define another series by $y_t = (x_t-m)^2$. Notice that ${\hat \sigma_x^2}$ is actually the {\it sample mean} of the series $y_t$, whereas $\sigma_x^2$ is its true mean. Thus we apply equation~(\ref{rand.vsm}) to $y$, supposing the $y_t$ (i.e. $x_t$) independent: \begin{equation} \sigma_{\hat \sigma_x^2} \ = \ \Lambda m_y \ = \ {\sigma_y \over \sqrt{n}} \label{rand.vsv} \end{equation} Notice that: \begin{equation} \sigma_y^2 \eq \E[y_t^2] - (\E[y_t])^2 \eq \E[(x_t-m)^4] - \sigma_x^4 \label{rand.fourth} \end{equation} Thus to know the fluctuation in $\sigma^2$ we need to know $\sigma^4$, and to get the fluctuation in that we need to know... I know of only two ways to escape this infinite regress. First we could simply measure (\ref{rand.fourth}) on our data sample. Alternately we could assume something about probability functions. For example, if the data is known to be drawn from a Gaussian probability function then it can be shown that $\E[(x_t-m)^4]=3\sigma_x^4$. For any probability function there is a fixed ratio between the second and fourth moments. Proceeding with the Gaussian assumption we have \begin{equation} \sigma_{\hat \sigma_x^2} \ \geq \ {\sigma_x^2\over \sqrt{n}} \end{equation} Choosing to work with $n$ points from a signal (which is theoretically of infinite duration) we have $\Lambda T =n\, dt$. If the sequence is zero mean, we can rewrite the previous inequality in terms of the power $p_x$, since $\sigma_x^2=p_x$. \begin{equation} \left( {\Lambda p_x \over p_x} \right)^2 {\Lambda T \over dt} \quad \geq 1 \label{4-2-25} \end{equation} Darche again points out that if the samples $x_t$ were {\it not} independent, the inequality~(\ref{4-2-25}) would be even more loose. Effectively, remember equation~(\ref{sm-cor}), and apply it to the sequence $y_t=(x_t-m)^2$: \begin{equation} (\Lambda m_y)^2 \eq {\sigma_y^2\over n} + {2\over n^2}\sum_{p=1}^{n-1}(n-p)s_p \end{equation} where the $s_p$ are the autocorrelations of the sequence $y_t$. But as $y$ is formed of {\it positive} numbers, its autocorrelations must also be positive. Therefore now: \begin{equation} \Lambda m_y \ > \ {\sigma_y \over \sqrt{n}} \end{equation} which can be compared to equation~(\ref{rand.vsv}). Anyway, in both cases (independent or not), we are faced with the same dilemma: if we want to have an accurate estimation of the variance, we need a large number of samples, with conflicts with the possibility to have a time-varying variance. %\subsubsection{Variance of the sample variance ad infinitum} %\par %Another more esoteric occurrence of the variance of the sample variance %arises when you get fussy about error bounds. %In a scientific paper you may find %\begin{eqnarray} %m &= & \hat{m} \pm \Lambda m \label{4-2-21} \\ % &= & \hat{m} \pm \sigma/\sqrt{n} \label{4-2-22} %\end{eqnarray} %but since the variance $\sigma^2$ often is not known either, %it is necessary to use the estimated $\hat{\sigma}$, %that is %\begin{equation} %m \eq \hat{m} \pm \hat \sigma/\sqrt{n} \label{4-2-23} %\end{equation} %Of course (\ref{4-2-23}) really is not right %because we really should add something %to show that we use an estimation of $\sigma$, %so that we have an additional uncertainty %from error in $\hat{\sigma}$: we should include %the variance of the sample variance. As we saw, %we should have to estimate $\E(x_t^4)$, so again %including another error... %At this stage practical people generally turn their %attention elsewhere leaving the statisticians %who might compute the result from a presumed probability function, %and the relation (\ref{4-2-23}) can be considered as sufficient. % \exer \item Suppose the mean of a sample of random numbers is estimated by a triangle weighting function, i.e., $$\hat{m} \eq s \sum^n_{i = 0} (n - i)\, x_i $$ Find the scale factor $s$ so that $\E(\hat{m}) = m$. Calculate $\Lambda m$. Define a reasonable $\Lambda T$. Examine the uncertainty relation. \item A random series $x_t$ with a possibly time-variable mean may have the mean estimated by the feedback equation $$\hat{m}_t \eq (1 - \epsilon) \hat{m}_{t-1} + bx_t$$ \begin{description} \item[a.] Express $\hat{m}_t$ as a function of $x_t, x_{t-1}, \ldots ,$ and not $\hat{m}_{t - 1}$. \item[b.] What is $\Lambda T$, the effective averaging time? \item[c.] Find the scale factor $b$ so that if $m_t = m$, then $\E(\hat{m}_t) = m.$ \item[d.] Compute the random error $\Lambda m = \sqrt{\E(\hat{m} - m)^2 } $. [{\sc hint:} $\Lambda m$ goes to $\sigma \sqrt{\epsilon/2 }$ as $\epsilon \rightarrow 0$]. \item[e.] What is $(\Lambda m)^2 \Lambda T$ in this case? \end{description} %\item Show that %$$(\Lambda P)^2 = {1 \over n} [\E(x^4) - \sigma^4]$$ \exerend \section{SPECTRAL FLUCTUATIONS} Recall the basic model of time series analysis, namely random numbers passing through a filter. An sample of input, filter, and output amplitude spectra is shown in figure~\ref{rand-filter}. \sideplot{rand-filter}{1.85in} {\FIGS/rand/filter} {(rand-filter) Random numbers into a filter. Top is spectrum of random numbers. Middle is spectrum of filter. Bottom is spectrum of filter output. } From the spectrum of the output we can guess the spectrum of the filter, but the figure shows there are some limitations in our ability to do so. Let us analyze this formally. \par Observations of sea level for a long period of time can be summarized in terms of a few statistical averages such as the mean height $m$ and the variance $\sigma^2$. Another important kind of statistical average for use on such geophysical time series is the {\it power spectrum.} Many mathematical models explain only such statistical averages of data and not the data themselves. To recognize certain pitfalls and understand certain fundamental limitations on work with power spectra, we first consider the idealized example of random numbers. \plot{rand-auto}{2.15in} {\FIGS/rand/auto} { (rand-auto) Autocorrelation and spectrum of random numbers. } \par Figure~\ref{rand-auto} shows a signal that is a burst of noise, its Fourier transform, and the transform squared, and its inverse transform, the autocorrelation. Here the FT squared is the same as the more usual FT times its complex conjugate---because the noise burst signal is even so its FT is real. \par Notice that the autocorrelation has a big spike at zero lag. This spike represents the correlation of the random numbers with themselves. The other lags are much smaller. They represent the correlation of the noise burst with itself shifted. Theoretically, the noise burst is not correlated with itself shifted and these small fluctuations result from the finite extent of the noise sample. \par Imagine many other copies of Figure~\ref{rand-auto}. Ensemble averaging would be adding these other autocorrelations or equivalently adding these other spectra. The fluctuations aside the central lobe of the autocorrelation would be destroyed by ensemble averaging. The fluctuations in the spectrum would be smoothed out by ensemble averaging. The {\it expectation} of the autocorrelation is that it is an impulse at zero lag. The {\it expectation} of the spectrum is that it is a constant, namely \begin{equation} \E[\hat S(Z)] \eq S(Z) \eq {\rm const} \end{equation} \subsection{Paradox: $N \rightarrow \infty$ vrs the ensemble average} \par Now for the paradox. Imagine $n \rightarrow \infty$ in Figure~\ref{rand-auto}. Will we see the same limit as the ensemble average? Here are two contradictory points of view: \begin{itemize} \item For increasing $n$, the fluctuations on the non-zero autocorrelation lags get smaller, so the autocorrelation should tend to a impulse function. Its Fourier transform, the spectrum, should tend to a constant. \item But, for increasing $n$, as Figure~\ref{rand-nrand} on page~\pageref{'rand-nrand'}, the spectrum does not get any smoother, because the FT's should still look like random noise. \end{itemize} We will see that the first idea contains a false assumption. The autocorrelation does tend to an impulse, but the fuzz around the sides can't be ignored---although the fuzz tends to zero amplitude, it also tends to infinite extent and the product of zero with infinity here tends to have the same energy as the central impulse. \par To examine this further, let's see how these autocorrelations decrease to 0 with $n$ (the number of samples). Figure~\ref{rand-fluct} shows the autocorrelation samples as a function of $n$ in steps of $n$ increasing by factors of four. Thus $\sqrt{n}$ increases by factors of two. \plot{rand-fluct}{2.48in} {\FIGS/rand/fluct} { (rand-fluct) Autocorrelation as a function of number of data points. The random noise series (even) lengths are 60, 240, 960. } Each autocorrelation in the figure was normalized at zero lag. We see the sample variance for nonzero lags of the autocorrelation dropping off as $\sqrt{n}$. We also observe that the ratios between the values for the first non-zero lags and the value at lag 0 roughly fit $1/\sqrt{n}$. Notice also that the fluctuations drop off with lag. The drop off goes to zero at a lag equal the sample length. This is because the number of terms in the autocorrelation diminishes to zero at that lag. A first impression is that the autocorrelation fits a triangular envelope. More careful inspection shows that the triangle bulges upward at wide offsets, or large values of $k$ (slightly clearer on Figure~\ref{rand-auto}). Let's explain all these observations. Each lag of the autocorrelation is defined as: \begin{equation} s_k \eq \sum_{t=1}^{n-k}x_tx_{t+k} \end{equation} where $(x_t)$ is a sequence of 0-mean {\it independent} random variables. Thus, the expectations of the autocorrelations can be easily computed: \begin{eqnarray} \E(s_0) &\eq& \sum_1^n\E(x_t^2) \eq n\sigma^2_x \\ \E(s_k) &\eq& \sum_1^{n-k}\E(x_t)\E(x_{t+k}) \eq 0. \mbox{\hspace{0.5cm}(for $k\geq 1$)} \end{eqnarray} On Figure~\ref{rand-fluct}, the value at lag 0 is more or less $n\sigma^2_x$ (before normalization), the deviation being more or less the {\it standard deviation} (square-root of the variance) of $s_0$. On the other hand, for $k\geq 1$, as $\E(s_k)=0$, the value of the autocorrelation is directly the deviation of $s_k$, i.e.~something close to its standard deviation. We now have to compute the variances of the $s_k$. Let's write: \begin{equation} s_k \eq \sum_{t=1}^{n-k}y_k(t) \mbox{\hspace{1.0cm}(where $y_k(t) = x_tx_{t+k}$)} \end{equation} So: $s_k=(n-k){\hat m_{y_k}}$, where ${\hat m_{y_k}}$ is the sample mean of $y_k$ with $n-k$ terms. If $k\neq 0$, $\E(y_k)=0$, and we apply (\ref{rand.vsm}) to ${\hat m_{y_k}}$: \begin{equation} \E ({\hat m_{y_k}}^2) \eq {\sigma^2_{y_k} \over n-k} \end{equation} The computation of $\sigma^2_{y_k}$ is straightforward: \begin{equation} \sigma^2_{y_k} \eq \E(x^2_tx^2_{t+k}) \eq \E(x_t^2)\E(x_{t+k}^2) \eq \sigma^4_x \;, \end{equation} So, for the autocorrelation $s_k$: \begin{equation} \E(s_k^2) \eq (n-k)\sigma^2_{y_k} \eq (n-k)\sigma^4_x \eq {n-k\over n^2}(\E(s_0))^2 \end{equation} Finally, as $\E(s_k)=0$, we get: \begin{equation} \sigma_{s_k} \eq \sqrt{E(s_k^2)} \eq \E(s_0) \ {\sqrt{n-k} \over n} \label{rand.vsvcor} \end{equation} This final result explains the properties observed on Figure~\ref{rand-fluct}. As $n\rightarrow\infty$, all the non-0 lags tend to 0 compared to the 0-lag, since $\sqrt{n-k}/n$ tends to 0. Then, the first lags ($k< \quad 1 \label{4-3-9} \end{equation} We find this idea on Figure~\ref{rand-taper}. I said earlier that $\Lambda F(\Lambda p/p)^2$ should be a constant. Actually: \begin{equation} \Lambda F \sim 1/\Lambda T_{\rm auto} \mbox{\hspace{0.5cm}and\hspace{0.5cm}} \left({\Lambda p\over p}\right)^2 \sim {\Lambda T_{\rm auto}\over\Lambda T} \end{equation} So we find in this case that: \begin{equation} \Lambda F \, \Lambda T\, \left( {\Lambda p \over p}\right)^2 \ \sim 1 \end{equation} \section{CROSSCORRELATION AND COHERENCY} With two time series we can see how they are related. \subsection{Correlation} Correlation is a concept similar to cosine. A cosine measures the angle between two vectors. It is given by the dot product of the two vectors divided by their magnitudes \begin{equation} c \eq { ({\bf x} \cdot {\bf y}) \over \sqrt{ ({\bf x} \cdot {\bf x}) ({\bf y} \cdot {\bf y}) } } \end{equation} which is the sample normalized correlation we first encountered on page~\pageref{'normalized_correlation'} as a quality measure of fitting one image to another. \par Formally, the normalized correlation is defined with $x$ and $y$ being zero-mean scalar random variables instead of sample vectors. So the summation is an expectation instead of a dot product. \begin{equation} c \eq {\E(xy) \over \sqrt{ \E(x^2)\, \E(y^2) } } \end{equation} \par There is a practical difficulty when the ensemble averaging is simulated over a sample. The problem arises with small samples and is most dramatically illustrated for a sample with only one element. Then the sample correlation is \begin{equation} \hat{c} \eq {xy \over |x| \, |y|} \eq \pm 1 \end{equation} regardless of what value the random number $x$ or the random number $y$ should take. For any $n$, the sample correlation $\hat{c}$ scatters away from zero. Such scatter is called {\it bias.} The topic of bias and variance of coherency estimates is a complicated one, but a rule of thumb seems to be to expect bias and variance of $\hat{c}$ about $1/\sqrt{n}$ for samples of size $n$. Bias, no doubt, accounts for many false ``discoveries'' since cause-and-effect is often inferred from correlation. \subsection{Coherency} The concept of {\it coherency} in time-series analysis is analogous to correlation. Taking $x_t$ and $y_t$ to be time series, they may have a mutual relationship that depends on time-delay, scaling, or even filtering. For example, perhaps $Y(Z) = F(Z) X(Z) + N(Z)$ where $F(Z)$ is a filter and $n_t$ is unrelated noise. The generalization of the correlation concept is to define {\it coherency} by \begin{equation} C \eq {\E\left[ X \left( {1 \over Z}\right) \, Y(Z) \right] \over \sqrt{\E(\overline {X} X)\, \E(\overline {Y} Y) } } \end{equation} \par {\it Correlation} is a real scalar. {\it Coherency} is a complex function of frequency and it expresses the frequency dependence of correlation. In forming an estimate of coherency it is always essential to simulate ensemble averaging. Note that if the ensemble averaging were to be omitted, the coherency (squared) calculation would give \begin{equation} |C|^2 \eq \overline{C} C \eq {(\overline{\overline{X} Y}) (\overline{X} Y) \over (\overline{X} X) (\overline{Y} Y)} \eq {\rm const}(\omega) \end{equation} which states that the coherency squared is unity independent of the data. Because correlation scatters away from zero we find that coherency squared is biased away from zero. \subsection{Bispectrum} The {\it bispectrum} is another statistic that is used to search for nonlinear interactions. For a Fourier transform $F(\omega)$ it is defined by \begin{equation} B(\omega_1,\omega_2) \eq \E [ F(\omega_1) F(\omega_2) \overline{F(\omega_1+\omega_2)}] \end{equation} Another statistic defined analogously is the {\it bispectral coherency}. In seismology, signals rarely have adequate duration for time averages to make sensible bispectral estimates. \section{SMOOTHING IN TWO DIMENSIONS} In previous sections we assumed one-dimensional models, and that you could easily do the smoothing. In two dimensions things are nominally much more costly, but a few things have tricks, so they can be easy too. Here I tell you my favorite trick for smoothing in two dimensions. You can convolve with a two-dimensional (almost) Gaussian weighting function {\it of any area} for a cost of only sixteen additions per output point. (You might expect a cost proportional to the area). \subsection{Tent smoothing} \par First recall triangular smoothing in one dimension documented by subroutine {\tt triangle()} on page~\pageref{triangle}. This routine is easily adapted to two dimensions. First you smooth in the direction of the 1-axis for all values of the 2-axis. Then you do the reverse, convolve on the 2-axis for all values of the 1-axis. Now recall that smoothing with a rectangle is especially fast, because you do not need to add all the points within the rectangle. You merely adapt a shifted rectangle by adding a point at one end and subtracting a point at the other end of the rectangle. In other words, the cost of smoothing is independent of the width of the rectangle. And no multiplies are required. To get a triangle, you smooth twice with rectangles. \par Figure~\ref{rand-pyram} shows the application of triangle smoothers on two pulses in a plane. The plane was first convolved with a triangle on the 1-axis and then with another triangle on the 2-axis. This takes each impulse and smooths it into an interesting pyramid that I call a tent. \plot{rand-pyram}{1.69in} {\FIGS/rand/pyram} { (rand-pyram) Two impulses in two dimensions filtered with a triangle function along each spatial axis. Left: bird's eye view. Right: contours of constant altitude $z$. } The expected side-boundary effect is visible on the foreground tent. In the contour plot (of the same 120 by 40 mesh) you see that the cross-section of the tent is rectangular near the base and diamond shaped near the top. The altitude of the $j$th tent face is $z = a(x-x_j)(y-y_j)$ where $(x_j, y_j)$ is the location of a corner and $a$ is a scale. The tent surface is parabolic (like $z=x^2$) along $x=y$ but linear along lines parallel to the axes. A contour of constant $z$ is the (hyperbolic) curve $y = a+b/(x+c)$ (where $a$, $b$, and $c$ are different constants on each of the four faces). \label{'2Dgauss'} \subsection{Gaussian mounds} In Figure~\ref{rand-mound} you see the result of twice application of tent smoothing. \plot{rand-mound}{1.47in} {\FIGS/rand/mound} { (rand-mound) Two impulses in two dimensions filtered twice on each axis with a triangle function. Left: bird's eye view. Right: contours of constant altitude $z$. } Notice that the contours, instead of being diamonds and rectangles, have become much more circular. The reason for this is briefly stated as follows: Convolution of a rectangle with itself many times approachs the limit of a Gaussian function. (This is a well-known result called the ``central limit theorem''. It explained in the next Section.) It happens that the convolution of a triangle with itself is already a good approximation to the Gaussian function $z(x)= e^{-x^2}$. The convolution in $y$ gives $z(x,y)=e^{-x^2-y^2}= e^{-r^2}$ where $r$ is the radius of the circle. When the triangle on the 1-axis differs in width from the triangle on the 2-axis, then the circles become ellipses. \subsection{Speed of 2D Gaussian smoothing} \par This approximate Gaussian smoothing in two dimensions is very fast. Only eight add-subtract pairs are required per output point, and no multiplies at all except for final scaling. The compute time is independent of the widths of the Gaussian(!) You should understand this if you understood that one-dimensional convolution with a rectangle requires just one add-subtract pair per output point. Thus this technique should be useful in the process known as 2-dimensional slant stack. \exer \item Deduce that a 2-D filter based on the subroutine {\tt triangle()} on page~\pageref{triangle} to produce the 2-D quasi-Gaussian mound in Fig~\ref{rand-mound} has a gain of unity at zero (two-dimensional) frequency (also known as $(k_x,k_y)=0$). \item Let the 2-D quasi-Gaussian filter be known as $F$. Sketch the spectral response of $F$. \item Sketch the spectral response of $1-F$ and suggest a use for it. \item The tent filter can be implemented by smoothing first on the 1-axis and then on the 2-axis. The conjugate operator smooths first on the 2-axis and then on the 1-axis. The operator should be self-adjoint (equal to its conjugate), unless some complication arises at the sides or corners. How can a dot product test be used to see if a tent filter program is self-adjoint? \exerend \section{PROBABILITY AND THE CENTRAL-LIMIT THEOREM} One way to obtain random integers from a known probability function is to write integers on slips of paper and place them in a hat. Draw one slip at a time. After each drawing replace the slip in the hat. The probability of drawing the integer $i$ is given by the ratio $a_i$ of the number of slips containing the integer $i$ divided by the total number of slips. Obviously the sum over $i$ of $a_i$ must be unity. Another way to get random integers is to throw one of a pair of dice. Then all $a_i$ equal zero except $a_1 = a_2 = a_3 = a_4 = a_5 = a_6 = {1 \over 6}$. The probability that the integer $i$ will occur on the first drawing and the integer $j$ will occur on the second drawing is $a_i a_j$. If you draw two slips or throw a pair of dice, then the probability that the sum of $i$ and $j$ equals $k$ is the sum of all the possible ways this can happen. \begin{equation} c_k \eq \sum_i a_i a_{k-i} \label{4-5-1} \end{equation} Since this equation is a convolution, we may look into the meaning of the $Z$ transform \begin{equation} A(Z) \eq \cdots a_{-1} Z^{-1} + a_0 + a_1 Z + a_2 Z^2 + \cdots \label{4-5-2} \end{equation} In terms of $Z$ transforms the probability that $i$ plus $j$ equals $k$ is simply the coefficient of $Z^k$ in \begin{equation} C(Z) \eq A(Z)\, A(Z) \label{4-5-3} \end{equation} \exer \item A random number generator provides random integers 2, 3, and 6 with probabilities $p(2)=1/2$, $p(3)=1/3$, and $p(6)=1/6$. What is the probability that any given integer $n$ is the sum of three of these random numbers? (HINT: Leave the result in the form of coefficients of a complicated polynomial). \exerend \subsection{The central-limit theorem} The central-limit theorem of probability and statistics is perhaps the most important theorem in the fields of probability and statistics. A derivation of the central limit theorem explains why the Gaussian probability function is so frequently encountered in nature; not just in physics but also in the biological and social sciences. No experimental scientist should be unaware of the basic ideas behind this theorem. Although the result is deep and is even today the topic of active research, we can quickly go to the basic idea. \par From equation (\ref{4-5-3}), if we add $n$ random numbers, the probability that the sum of them equals $k$ is given by the coefficient of $Z^k$ in \begin{equation} G(Z) \eq [A(Z)]^n \label{4-5-4} \end{equation} The central-limit theorem of probability says that as $n$ goes to infinity the polynomial $G(Z)$ goes to a special form, almost regardless of the specific polynomial $A(Z)$. The specific form is such that a graph of the coefficients of $G(Z)$ comes closer and closer to fitting under the envelope of the bell-shaped Gaussian function. This happens because if you raise any function to a high enough power, eventually all you can see is the highest value on the function and its immediate environment, i.e. the second derivative there. We already saw an example of this in Figure~\ref{conv-gauss}. Exceptions to the central-limit theorem arise (1) when there are multiple maxima the same height, and (2) where the second derivative vanishes at the maximum. \plot{rand-clim}{1.11in}{\FIGS/rand/clim}{ (rand-clim) Left: wiggle plot style. Middle: perspective. Right: contour. } \par Although the central limit theorem tells you that a Gaussian function is the limit as $n \rightarrow \infty$ it doesn't say anything about how fast the limit is attained. To test it, I plotted the coefficients of $(.25/Z+.5+.25Z)^n$ for large values of $n$. This signal is scaled binomial coefficients. To keep signals in a suitable amplitude scale, I multiplied them by $\sqrt{n}$. Figure~\ref{rand-clim} shows views of the coefficients of $\sqrt{n}(.25/Z+.5+.25Z)^n$ (horizontal axis) versus $\sqrt{n}$ (vertical axis). We see that scaling by $\sqrt{n}$ has kept signal peak amplitudes constant. We see the width of the signal increases linearly with $\sqrt{n}$. The contours of constant amplitude show that the various orders are nearly perfectly self similar with the width stretching. \todo{Robust estimation in robust.tex}