%+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
%+                                                                           +
%+  NOTICE:  This article is also available formatted with illustrations.    +
%+                                                                           +
%+  SEE:     http://sepwww.stanford.edu/oldreports/sep61/                    +
%+                                                                           +
%+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


\chapter{Resolution and random signals}
\setcounter{chapter}{\RAND}

The accuracy of measurements on observed signals
is limited not only by practical realities,
but also by certain fundamental principles.
A well-known example is the time-bandwidth product
in Fourier transformation theory called the ``uncertainty principle.''

\par
Observed signals often look random
and often they are modeled by filtered random numbers.
In this chapter we will see many examples of signals
built from random numbers
and see how the nomenclature of statistics applies
to them.
Fundamentally, this chapter characterizes {\it resolution,}
resolution of frequency and arrival time,
and the statistical resolution of signal amplitude and power
as functions of time and frequency.

\par
We will see $\sqrt{n}$ popping up everywhere.
For example, signals that are theoretically uncorrelated
generally appear to be weakly correlated
at a level of $1/\sqrt{n}$ where $n$
is the number of independent points in the signal.

\par
Measures of resolution
(which are called variances, tolerances,
uncertainties, bandwidths, durations, spreads, spans, etc.)
often interact with one another so that
experimental change to reduce one
must necessarily increase another or some combination of the others.
This chapter exhibits basic cases
where such conflicting interactions occur.

\par
To avoid confusion I introduce the unusual notation $\Lambda$
where $\Delta$ is commonly used.
Notice the letter $\Lambda$ resembles the letter $\Delta$
and $\Lambda$ connotes length without being confused with wavelength.
Lengths on the time and frequency axes are defined as follows.
\begin{tabbing}
indent \= dt,,df,space \=  description.conjugate.conjugate.conjugate \= \kill
\>$dt$, $df$			\>mesh intervals in time and frequency \\
\>$\Delta  t$, $\Delta  f$	\>mesh intervals in time and frequency \\
\>$\Delta  T$  $\Delta  F$	\>extent of time and frequency axis\\
\>$\Lambda T$, $\Lambda F$    \>time duration and spectral bandwidth of a signal
\end{tabbing}

\par
I wish I could give you formal definitions
of time duration $\Lambda T$ and spectral bandwidth $\Lambda F$.
A variety of defining equations are easy to write,
and many are in general use.
The main idea is that the time span
$\Lambda T$ or the frequency span $\Lambda F$
should be able to include most of the energy
but need not contain it all.
The time duration of a damped exponential function is infinite
if by duration you mean the span of nonzero function values.
However, for practical purposes
the time span is generally chosen as the time required
for the amplitude to decay to $e^{-1}$ of its original value.
For many functions the span is defined
by the span between points
on the time or frequency axis
where the curve (or its envelope) drop to half of the maximum value.
Since there is no mathematically tractable
and univerally acceptable definition for time span and spectral bandwidth
this chapter tends to be more descriptive than analytic.


\section{TIME-FREQUENCY RESOLUTION}

\par
Scaling a function to be narrower in one domain
also scales it to be wider in the other domain.
This is a consequence of Fourier transforms being
built from $e^{i\omega t}$.
Scaling $\omega$ implies inverse scaling of $t$ to keep
the product $\omega t$ constant.
For example the FT of a rectangle is a sinc.
Making the rectangle narrower broadens the sinc in proportion
because $\omega t$ is constant.
Here we will quantify this relationship between
the signal width $\Lambda T$ and its spectral bandwidth $\Lambda F$.

\par
A pure sinusoidal wave has a clearly defined frequency,
but it is spread over the infinitely long time axis.
At the other extreme is a delta function,
which is nicely compressed to a point on
the time axis but contains a mixture of all frequencies.
Here we examine how the width of a function
in one domain relates to that in the other.
By the end of this section, we will conclude that for any function,
the time duration $\Lambda T$ and the spectral bandwidth $\Lambda F$
are related by
\begin{equation}
\Lambda F \; \Lambda T \quad \geq \quad 1 \label{4-1-1}
\end{equation}
This relationship is often called the {\it uncertainty principle.}
\par
Since we are unable to find a precise and convenient analysis
for the definitions of $\Lambda F$ and $\Lambda T$,
the inequality~(\ref{4-1-1}) is not strict.
What is important is that
rough equality in (\ref{4-1-1}) is observed for many simple functions,
but for others the inequality can be extremely slack (far from equal).
Strong {\it inequality} arises from all-pass filters.
An all-pass filter leaves the spectrum unchanged,
and hence $\Lambda F$ unchanged, 
but it can spread out the signal arbitrarily,
increasing $\Lambda T$ arbitrarily.
Thus the time-bandwidth maximum is unbounded for all-pass filters.
Some people say that the Gaussian function has the minimum product
in ~(\ref{4-1-1}) but that really depends on a particular method
of measuring $\Lambda F$ and $\Lambda T$.

\subsection{Counter example refuted}
\par
It is easy to misunderstand the uncertainty principle.
An oversimplification of it is to say that it is
``impossible to know the frequency at any particular time.''
This oversimplification leads you to think about a truncated
sinusoid, such as in Figure~\ref{rand-windcos}.
You know the frequency exactly,
so $\Lambda F$ seems zero,
whereas $\Lambda T$ is finite,
and this seems to violate (\ref{4-1-1}).
But what the figure shows is that the truncation of the sinusoid
has broadened the frequency band.
More particularly the impulse function in the frequency domain
has been convolved by the sinc function which is the Fourier transform
of the truncating rectangle function.
\plot{rand-windcos}{1.00in} {\FIGS/rand/windcos}  {(rand-windcos)
	Windowed sinusoid and its Fourier transform.
\todo{Redo with cosine on right edge.}
	}

\plot{rand-uncertain}{2.56in} {\FIGS/rand/uncertain}  {(rand-uncertain)
	Sampled Gaussian functions and their Fourier transforms
	for in vectors of length $n$ = 16, 64, and 256.
	}
\subsection{Measuring the time-bandwidth product}
\par
Now examine Figure~\ref{rand-uncertain} containing
sampled Gaussian functions and their Fourier transforms.
The Fourier transform of a Gaussian
is well-known to be another Gaussian function as the plot confirms.
I adjusted the width of each Gaussian so that it
would be about equal in each of the two domains.
The Gaussians were sampled at various values of $n$,
increasing in steps by a factor of 4.
You can measure the width dropping by a factor of 2 at each step.
For those of you who have already learned about the uncertainty principle
it may seem paradoxical that the function's width is dropping
in both the time domain and the frequency domain.
\par
The resolution of the paradox is that the physical time axis
or the frequency axis length
is varying as we change $n$
(even though the plot length is scaled to a constant on the page).
We need to associate a physical mesh with the computational mesh.
Two methods of associating a physical mesh with a computational mesh
were described in chapter \FT\ on page~\pageref{'mesh'}.
In real physical space as well as Fourier transform space,
the object remains a constant size as the mesh is refined.
\par
Let us read from Figure~\ref{rand-uncertain} values for
the widths $\Lambda F$ and $\Lambda T$.
On the top row with $n=16$, I pick a width of about 4 points,
and this seems to include about 90\%
of the area under the function.
So for this signal (with the widths roughly equal in each domain)
it seems that
$\Lambda T = \sqrt{n} dt$ and $\Lambda F = \sqrt{n}df$.
Using the relation between $dt$ and $df$ found
in chapter \FT\ equation (\ref{ft.dtdf}) that $dt\,df= 1/n$,
the product becomes
$\Lambda T \Lambda F$ = 1.

\par
We could also confirm the inequality (\ref{4-1-1}) by simple functions
for which we know the analytic transforms.
For example, consider an impulse function in time.
Then $\Lambda T = dt$ and the Fourier transform occupies
the entire frequency band, i.e.~$\Lambda F = 1/dt$.
Thus again the product is $\Lambda T \Lambda F$ =  1.

\subsection{Uncertainty principle in physics}
\par
The inequality (\ref{4-1-1}) gets its name,
the ``uncertainty principle,''
from its interpretation in quantum mechanics.
Observations of subatomic particles
show they behave like waves with spatial frequency
proportional to particle momentum.
The classical laws of mechanics
enable prediction of the future of a mechanical system
by extrapolation from the currently known position and momentum.
But because of the wave nature of matter
with momentum proportional to spatial frequency,
such prediction requires simultaneous knowledge
of both the location and the spatial frequency of the wave.
This is impossible, as we see from (\ref{4-1-1}),
hence the word, ``uncertainty.''

\subsection{Gabor's proof of the uncertainty principle}
\par
Although it is easy to verify the uncertainty principle in many special cases,
it is not easy to deduce it.
The difficulty begins from finding a definition of the width
of a function that leads to a tractable analysis.
One possible definition is to use a second moment,
that is, to define $\Lambda T$ by
\begin{equation}
(\Lambda T)^2 \eq { \int  \ t^2 \, b(t)^2\, dt
             \over \int           b(t)^2\, dt }
\label{rand.2nd_moment}
\end{equation}
The spectral bandwidth $\Lambda F$ is defined likewise.
With these definitions,
Dennis Gabor prepared a widely reproduced proof.
I'll omit his proof here;
it is not an easy proof;
it is widely available;
and the definition
(\ref{rand.2nd_moment})
seems inappropriate
for a function we often use,
the sinc function, i.e.~the FT of a step function.
Since the sinc function drops off as $t^{-1}$,
its width $\Lambda T$ defined with
(\ref{rand.2nd_moment})
is infinity,
which is unlike the more human measure of width,
the distance to the first axis crossing.

\subsection{My risetime proof of the uncertainty principle}
\par
In ``\FGDP'' (FGDP) I came up with a proof of 
the uncertainty principle
that is appropriate for causal functions.
That proof is included directly below,
but I recommend the beginning reader skip over it because
it is somewhat lengthy.
It is included because
this book is oriented towards causal functions,
the proof is not well known,
and it is now improved from FGDP.
\par
A similar and possibly more basic concept
than the product of time and frequency spreads
is the relationship between spectral bandwidth
and {\it rise time} of a system response function.
The {\it rise time} $\Lambda T$
of a system response is defined
as follows:
when you kick a physical system with an impulse function,
it usually responds rapidly
rising to a some maximum level,
and then drops off more slowly towards zero.
The quantitative value of the rise time is also
somewhat arbitrarily,
generally taken to be
span between the time of excitation and the time at which the system
response is more than half way up to its maximum.
\par
{\it Tightness} (nearness to equality)
in the inequality (\ref{4-1-1}) is associated
with minimum phase.
{\it Slackness} (farness from equality)
in the inequality
(\ref{4-1-1}) would arise from a filter with an additional all-pass component.
Slackness could also come from
a decay time that is more rapid than the rise time.
Slackness could also result from other combinations
of rises and falls such as random combinations.
Minimum phase systems generally
respond rapidly compared to the rate at which they later decay.
Focusing our attention on such systems,
we can now seek to derive the inequality 
(\ref{4-1-1}) applied to rise time and bandwidth.
\par
The first step is to choose
a definition for rise time.
I have found a tractable definition of rise time to be
\begin{equation}
{1 \over \Lambda T}
\eq
{\int^{\infty}_0 {1 \over t}\, b(t)^2\, dt
\over
\int^{\infty}_0 b(t)^2\, dt}  
\label{4-1-2}
\end{equation}
where $b(t)$ is the response function under consideration.
Equation
(\ref{4-1-2})
defines $\Lambda T$
by the first negative moment.
Since this is unfamiliar, consider two examples.
Taking $b(t)$ to be a step function,
the numerator integral diverges,
giving the desired $\Lambda T = 0$ rise time.
For another example,
take $b(t)^2$ to grow linearly from zero to $t_0$ and then vanish.
Then the rise time $\Lambda T$ is $t_0/2$, again reasonable.
It is curious that $b(t)$ could grow as $\sqrt{t}$ which
rises with infinite slope at $t=0$ and $\Lambda T$
would not be pushed to 0.


\subsubsection{Proof by way of the dual problem}
\par
Although the $Z$ transform method
is a great aid in studying cases
where divergence (as $1/t$) plays a role,
it has the disadvantage that it destroys the formal interchangeability
between the time domain and the frequency domain.
To take advantage of the analytic simplicity
of the $Z$ transform
we consider instead the dual to the rise-time problem.
Instead of a signal
whose square vanishes identically at negative time we have a spectrum
$\overline{B} (1/Z) B(Z)$ which vanishes at negative frequencies.
We measure how fast this spectrum can rise after $\omega = 0$.
We will find this to
be related to the time duration $\Lambda T$ of the complex signal $b_t$.
More precisely, we will now define the lowest significant frequency
component $ \Lambda  F$ in the spectrum analogously to
(\ref{4-1-2}) to be
\begin{equation}
{1 \over \Lambda  F}
\eq
\int^{\infty}_{-\infty} 
{1 \over f} \ \overline{B} B\, df
\eq
\int^{\infty}_{-\infty} 
\overline{B} B\,
{d \omega \over \omega}
\label{4-1-4}
\end{equation}
where we have assumed the spectrum is normalized,
i.e.~the zero lag of the auto-correlation of $b_t$ is unity.
Now recall the bilinear transform,
equation~(\ref{zz.bilin}), which represents $1/(-i\omega)$ as
the coefficients of $.5(1+Z)/(1-Z)$, namely,
$(\ldots 0, 0, 0, 0.5, 1, 1, 1 \ldots )$.

The pole right on the unit circle at $Z = 1$ causes some nonuniqueness.
Because $1/i \omega$ is an imaginary odd frequency function,
we'll want an odd expression
(such as on page~\pageref{'odd_integral'})
to insert into (\ref{4-1-4}) such as
\begin{equation}
{1 \over -i \omega }
\eq
{(\cdots - Z^{-2} - Z^{-1} + 0 + Z + Z^2 + \cdots) \over
2} \label{4-1-5}
\end{equation}
Using limits on the integrals for time-sampled functions and
inserting (\ref{4-1-5}) into (\ref{4-1-4}) gives
\begin{equation}
{1  \over \Lambda  F}
\eq
{-i \over 2\pi} \int^{+\pi}_{-\pi} {1 \over 2} 
  (\cdots -Z^{-2} -Z^{-1} + Z +Z^2 + \cdots )\ \overline{B} \left( {1 \over Z}
   \right) \, B(Z)\, d \omega  \label{4-1-6}
\end{equation}
Let $s_t$ be the autocorrelation of $b_t$.
Since any integral around the unit circle
of a $Z$-transform polynomial selects
the coefficient of $Z^0$ of its integrand we have
\begin{eqnarray}
{1  \over \Lambda  F}
&=&
{-i \over 2} \left[
		(s_{-1} - s_1) +
		(s_{-2} - s_2) +
		(s_{-3} - s_3) +
		\cdots
		\right]
\eq \sum^{\infty}_{t = 1} - \Im s_t
					\label{rand.wide} \\
{1  \over \Lambda  F}
&=&
\sum^{\infty}_{t = 1} - \Im s_t
\quad \leq \quad
 \sum^{\infty}_{t = 1} |s_t|
					 \label{4-1-8}
\end{eqnarray}
The height of the autocorrelation has been normalized to $s_0=1$.
The sum in (\ref{4-1-8}) is an integral representing area under the
$|s_t|$ function.
So the area is a measure of the autocorrelation width $\Lambda T_{\rm auto}$.
Thus
\begin{equation}
{1  \over \Lambda  F}
\quad \leq \quad
\sum^{\infty}_{t = 1} |s_t|
\eq
\Lambda T_{\rm auto}
                \label{4-1-9}  
\end{equation}
\par
Finally, we must relate the duration of a signal $\Lambda T$ to the
duration of its autocorrelation $\Lambda T_{\rm auto}$.
Generally speaking,
it is easy to find a long signal that has short autocorrelation.
Just take an arbitrary short signal
and convolve it by a lengthy all-pass filter.
Conversely, you can't get a long autocorrelation function
from a short signal.
A good case would be the autocorrelation of a rectangle
function  that is a triangle.
Nominally, the triangle seems to be twice as long,
but considering that the triangle tapers down,
it is reasonable to assert the $\Lambda T$'s are the same.
So we conclude
\begin{equation}
\Lambda T_{\rm auto} \label{4-1-10}
\quad \leq \quad
\Lambda T
\end{equation}
Inserting into (\ref{4-1-9}) we have the uncertainty relation
\begin{equation}
\Lambda T \ \Lambda F \quad \geq \quad 1 \label{4-1-11}
\end{equation}

\par
Looking back over the proof I feel the basic
time-bandwidth idea is in the {\it equality}~(\ref{rand.wide})
for which I have no concise verbalization.
The {\it inequality} arises from
$\Lambda T_{\rm auto} < \Lambda T$
which is a simple idea.


\exer
\item
Consider $B(Z) = [1 - (Z/Z_0)^n]/(1 -Z/Z_0)$ in the limit $Z_0$ goes to
the unit circle.
Sketch the signal and its squared amplitude.
Sketch the frequency function and its squared amplitude.
Choose $\Lambda F$ and $\Lambda T$.
\item
A time series made up of two frequencies may be written as
$$b_t = A \cos \omega_1 t + B\sin \omega_1 t + C \cos \omega_2 t + D\sin
\omega_2 t$$
Given $\omega_1$, $\omega_2$, $b_0$, $b_1$, $b_2$, $b_3$ show how to calculate
the amplitude and phase angles of the two sinusoidal components.
\exerend

\section{FOURIER TRANSFORMS OF RANDOM NUMBERS}
\par
Many real signals are complicated and barely comprehensible.
In experimental work, we commonly transform such data.
To better understand what this means, 
it will be worthwhile examining signals made from random numbers.

\plot{rand-nrand}{2.75in} {\FIGS/rand/nrand}  { (rand-nrand)
	Fourier cosine transforms of vectors containing random numbers.
	N is the number of components in the vector.
	\label{'rand-nrand'}
	}
\par
Figure~\ref{rand-nrand} shows discrete Fourier transforms of random numbers.
The basic conclusion from this figure
is that transforms of random numbers
look like more random numbers.
\par
A random series containing all frequencies
is called a {\it white} noise series
because the color white is made from roughly equal amounts of all colors.
Any series made by independently chosen random numbers is
said to be an {\it independent} series.
An independent series must be white,
but a white series need not be independent.

\plot{rand-pad}{1.62in} {\FIGS/rand/pad}  { (rand-pad)
	Zero padded random numbers and their Fourier transforms.
	}
Figure~\ref{rand-pad} shows Fourier transforms of random numbers
surrounded by zeros, in other words, the random numbers
have been {\it zero padded.}
Since each vector of random numbers is the same length
(1024 points including both sides of the even function
with 513 points shown)
the transforms are also the same length.
The top signal has less randomness than the second trace
(16 random numbers versus 64).
Thus the top FT is smoother than the lower ones.
The best way to understand this figure
is to think of the left-hand signal as a frequency function.
When higher frequencies are present,
the right-hand signal oscillates faster.

\subsection{Bandlimited noise }
Figure~\ref{rand-shift} shows bursts of 25 random numbers at various shifts,
and their Fourier transforms.
You can think of either side of the figure as the time domain
while the other side is the frequency domain.
(See Chapter \FT\ page~\pageref{'plot_format'} for a description
of the different ways of interpreting plots of one side
of Fourier transform pairs of an even function.)
I like to think of the left side as the Fourier domain and the right
side as the signals.
Then the signals seem to be sinusoids of a constant frequency
(called the {\it center} frequency)
and an amplitude that is modulated at a slower rate
(called the beat frequency).
Observe that the center frequency is
related to the {\it location} of the random bursts
and that the beat frequency
is related to the {\it bandwidth} of the noise burst.
\plot{rand-shift}{1.71in} {\FIGS/rand/shift}  { (rand-shift)
	Shifted, zero padded random numbers
	in bursts of 25 numbers.
\todo{Remake with all bursts identical.}
	}

\par
You can also think of Figure~\ref{rand-shift} as having one sided
frequency functions on the left, and the right side
as being the {\it real part} of the signal.
The real parts are cosine like, whereas the imaginary parts
(not shown) are sine like, with the same envelope function.

\par
You might have noticed that the bottom plot in 
Figure~\ref{rand-shift}
with the Nyquist frequency modulated beats,
seems to have about
twice as many beats as the two plots above it.
This can be explained as an edge effect.
The noise burst near the Nyquist frequency
is really twice as wide as shown
because it is mirrored about the Nyquist frequency into negative frequencies.
Likewise, the top figure is not modulated at all,
but the signal itself has a frequency
that matches the beats on the bottom figure.

\section{TIME-STATISTICAL RESOLUTION}
\footnote{Corrections and clarifications were made
in this section by Gilles Darche.  }
If you flipped a fair coin 1000 times,
it is unlikely that you would get exactly 500 ``heads'' and 500 ``tails.''
More likely the head count would lie somewhere between 400 and 600.
Or would it lie in another range?
The theoretical value, called the {\it mean} or the
{\it expectation} is 500.
The value from your experiment is called the {\it sample mean}.
How much difference $\Lambda m$ should we expect
between the sample mean $\hat m$ and the true mean $m$?
Both the coin flips $x$ and your sample mean $\hat m$
are {\it random variables}.
Your coin flip experiment could be repeated many times
generally giving a different result each time.
This concept will be formalized as
{\it the variance of the sample mean} which is
the expected squared difference between the true mean
and the mean of your sample.

\par
The problem of estimating the statistical parameters of 
a time series, such as its mean, also appears in seismic 
processing. Effectively, we deal with seismic traces of
finite duration, extracted from infinite sequences whose
parameters can only be estimated from the finite set of 
values available in these seismic traces. Since the 
knowledge of these parameters, such as signal/noise 
ratio, can play an important role during the processing,
it can be useful not only to estimate them, but also
to have an idea of the error made in this estimation.

\subsection{Ensemble}
The ``true value'' of the mean could be defined by flipping the coin $n$ times
and conceiving of $n$ going to infinity.
A more convenient definition of ``true value''
is that the experiment is imagined
as having been done separately
under identical conditions
by an infinite number of people (an ensemble).
The {\it ensemble} may seem a strange construction of the imagination,
nonetheless much literature in statistics and in the natural sciences
uses the idea of an ensemble. Let's say that the ensemble is defined
by all the possible values the variable can take at one particular time,
associated with a probability of occurrence.
Then, the ensemble idea enables us to define a time-variable mean
(the sum of these values weighted by the probabilities),
for example for coins that change with time.

\subsection{Expectation and Variance}
A conceptual average over the ensemble,
called an expectation,
is denoted by the symbol $\E$.
The index for summation over the ensemble is never shown explicitly;
every random variable is presumed to have one.
Thus, the true mean at time $t$ is defined as ${m_x}(t)=\E(x_t)$.
The mean can vary with time:
\begin{equation}
m_x(t) \eq \E[x(t)]  \label{4-2-2}
\end{equation}
Likewise the {\it power} is
\begin{equation}
p_x(t) \eq \E[x(t)^2]
\end{equation}
The {\it variance} $\sigma^2$
is defined to be the power after the mean is removed, i.e.
\begin{equation}
\sigma_x(t)^2 \eq  \E\, [x(t) - m_x(t)]^2
\label{4-2-4}
\end{equation}
Conventionally, $\sigma^2$ is called the variance,
and $\sigma$ is called the {\it standard deviation}.

\par
For notational convenience it is customary to write
$m(t)$, $p(t)$, $\sigma(t)$, and $x(t)$ simply as
$m$, $p$, $\sigma$, and $x_t$
using the verbal context to specify whether
$m$, $p$, and $\sigma$
are time variable or constant.
For example, the standard deviation of the seismic
amplitudes on a seismic trace before correction of spherical
divergence decreases with time, since these amplitudes are expected 
to be ``globally'' smaller as time goes on.

\par
When manipulating algebraic expressions
the symbol $\E$ behaves like a summation sign, namely
\begin{equation}
\E \quad \equiv \quad
(\lim N \rightarrow \infty) \quad {1 \over N} \sum^N_1 \label{4-2-5}
\end{equation}
Note that the summation index is not given,
since the sum is over the ensemble, not time.
To get some practice with the expectation symbol $\E$, we
can reduce equation~(\ref{4-2-4})
\begin{equation}
\sigma_x^2
\eq  \E\, [(x_t - m_x)^2]
\eq  \E(x_t^2) \ - \ 2 m_x \E(x_t) + m_x^2
\eq  \E(x_t^2) \ - \ m_x^2
\end{equation}
which says the energy is the variance plus the squared mean.

\subsection{Sample mean}
Now let $x_t$ be a time series made up from identically distributed 
random numbers: $m_x$ and $\sigma_x$ do not depend on time.
Let's also suppose that they are {\it independently} chosen; this
means in particular that for any different times $t$ and $s$ ($t\neq s$):
\begin{equation}
\E(x_tx_s) \eq \E(x_t) \E(x_s)
\end{equation}
Suppose we have a sample of $n$ points of $x_t$ and are
trying to determine the value of $m_x$.
We could make an estimate $\hat m_x$ of the mean $m_x$ with the formula
\begin{equation}
\hat m_x \eq {1 \over n} \sum^n_{t = 1} x_t \label{4-2-6}
\end{equation}


\par
A somewhat more elaborate method of estimating the mean
would be to take a weighted average.
Let $w_t$ define a set of weights normalized so that
\begin{equation}
\sum w_t \eq 1 \label{4-2-7}
\end{equation}
With these weights the more elaborate estimate $\hat{m}$ of the mean is
\begin{equation}
\hat m_x \eq \sum w_t \, x_t \label{4-2-8}
\end{equation}
Actually (\ref{4-2-6}) is just a special case of (\ref{4-2-8}) where the 
weights are $w_t = 1/n$.

\par
Further, the weights could be {\it convolved} on the random time series,
to compute {\it local} averages of this time series, thus smoothing it.
The weights are simply a filter response where the filter coefficients happen
to be positive and cluster together.
An example in Figure~\ref{rand-walk} shows a random walk function
with itself smoothed locally.
\plot{rand-walk}{1.96in}{\FIGS/rand/walk}{(rand-walk)
	Random walk and itself smoothed (and shifted downward).
	}

\subsection{Variance of the sample mean}
Our objective here is to calculate how far the estimated mean
$\hat{m}$ is likely to be from the true mean $m$ for a sample of length $n$.
This departure is called the
{\it variance of the sample mean},
and it is given by $(\Lambda m)^2=\sigma_{\hat m}^2$, where
\begin{eqnarray}
\sigma_{\hat m}^2 
            &= & \E\, [(\hat{m} - m)^2] \label{4-2-9} \\
            &= & \E\, \left\{ [(\sum w_t x_t) - m]^2 \right\} \label{4-2-10}
\end{eqnarray}
Now use the fact that $m = m\sum w_t = \sum w_t m$:
\begin{eqnarray}
\sigma_{\hat m}^2 
&= &\E\, \left\{ \left[\sum_t w_t (x_t - m) \right]^2 \right\} \label{4-2-11}\\
&= &\E\, \left\{ \left[\sum_t w_t (x_t - m) \right] \,
   \left[\sum_s w_s (x_s - m) \right] \right\} \label{4-2-12}\\
&= &\E\,  \left[\sum_t \sum_s w_t w_s (x_t - m)(x_s - m) \right]  
       \label{4-2-13}
\end{eqnarray}
The step from (\ref{4-2-12}) to
(\ref{4-2-13}) follows because
\begin{equation}
(a_1+a_2+a_3)\,
(a_1+a_2+a_3)
\eq
{\rm sum \; of}  \quad
\left[ 
  \begin{array}{ccc}
   a_1 a_1 & a_1 a_2  & a_1 a_3 \\
   a_2 a_1 & a_2 a_2  & a_2 a_3 \\
   a_3 a_1 & a_3 a_2  & a_3 a_3
  \end{array}
\right] 
\end{equation}
The expectation symbol $\E$ may be regarded as another summation
that can be done after, as well as before, the sums on $t$ and $s$, so
\begin{equation}
 \sigma_{\hat m}^2 
\eq \sum_t \sum_s w_t \, w_s\,
\E\, \left[ (x_t - m)(x_s - m) \right]  
       \label{4-2-14}
\end{equation}
If $t\neq s$, as $x_t$ and $x_s$ are independent of each other,
the expectation $\E[(x_t - m)(x_s - m)]$ will vanish.
If $s = t$, then the expectation is the variance defined by (\ref{4-2-4}).
Expressing the result in terms of the Kronnecker delta, $\delta_{ts}$
(which equals unity if $t=s$, and vanishes otherwise) gives
\begin{eqnarray}
 \sigma_{\hat m}^2 
  &= & \sum_t \sum_s w_t \, w_s \,  \sigma_x^2 \delta_{ts}
							\label{4-2-15}\\
 \sigma_{\hat m}^2 
  &= & \sum_t w^2_t \, \sigma_x^2
							\label{4-2-16}  \\
 \sigma_{\hat m} 
  &= & \sigma_x \  \sqrt{ \sum_t w^2_t }	\label{4-2-17}
\end{eqnarray}
For $n$ weights,
each of size $1/n$,
the standard deviation of the sample mean is
\begin{equation}
\Lambda m_x \eq \sigma_{\hat m_x} \eq
\sigma_x \ \sqrt{ \sum^n_{t=1} \left( {1 \over n}\right)^2 } \eq
{\sigma_x \over \sqrt{n} }
\label{rand.vsm}
\end{equation}
This is the most important property of random numbers
that is not intuitively obvious.
Informally, the result (\ref{rand.vsm}) says this:
Given a sum $y$ of terms with random polarity, whose theoretical mean is $0$:
\begin{equation}
y \eq \underbrace{\pm 1 \pm 1 \pm 1 \cdots}_{n\ {\rm terms}}
\end{equation}
the sum $y$ is a random variable whose
standard deviation is $\sigma_y=\sqrt{n}=\Lambda y$.
An experimenter who doesn't know the mean is zero
would report the mean of $y$ is $\E (y) = \hat y \pm \Lambda y$,
where $\hat y$ is the experimental value.


\subsection{Dilemma in estimating a time variable mean}
If you are trying to estimate the mean of a random series
that has a time-variable mean then you face a basic dilemma.
Including many numbers in the sum to get $\Lambda m$ to be small
conflicts with the possibility of seeing $m_t$ change during the measurement.

Let a signal be multiplied by a window function in which we will estimate
the mean of the signal.
Let $\Lambda T = n\,dt$ be the length of the window.
Then (\ref{rand.vsm}) becomes
\begin{equation}
{(\Lambda m)^2 \over \sigma^2 }\,
{\Lambda T \over dt} \eq 1 \label{4-2-19}
\label{rand.meanfluct}
\end{equation}
It is desirable
to have both $\Lambda m$ and $\Lambda T$ as small as possible.
You see that the fluctuation in your estimation of the mean
drops off only as the square root of the sample size.
Equation~(\ref{rand.meanfluct}) depends on the assumption
that successive samples of the time series are independent.
If they are not independent,
I imagined that things would always be worse,
and the ``$=$'' in equation~(\ref{rand.meanfluct})
would be replaced by a ``$\ge$''.
This turns out to be overly pessimistic.
In the unusual case that successive lags of the signal
are anticorrelated, the ``$=$'' becomes a ``$\le$''
as shown in the following proof by Gilles Darche:


Observe what happens when
the samples $x_t$ and $x_s$ are {\it not} independent.
Suppose the weights $w_t$ uniform ($w_t=1/n$).
Following (\ref{4-2-14})
and (\ref{rand.vsm}):
\begin{eqnarray}
(\Lambda m)^2
&=& {\sigma^2\over n} + {1\over n^2}\sum\sum_{t\neq s}\E[(x_t-m)(x_s-m)] \\
&=& {\sigma^2\over n} +
	{2\over n^2}\sum_{t=1}^{n-1}\sum_{s=t+1}^n\E[(x_t-m)(x_s-m)] \\
&=& {\sigma^2\over n} +
	{2\over n^2}\sum_{t=1}^{n-1}\sum_{p=1}^{n-t}\E[(x_t-m)(x_{t+p}-m)] 
\end{eqnarray}
Suppose $x_t$ is {\it stationary}:
it implies that $\E[(x_t-m)(x_{t+p}-m)]$
depends only on $p$, and not on $t$.
This is the autocorrelation of $x_t$ at lag $p$,
which we write $s_p$. So:
\begin{equation}
(\Lambda m)^2 \eq 
{\sigma^2\over n} + {2\over n^2}\sum_{t=1}^{n-1}\sum_{p=1}^{n-t}s_p \eq 
{\sigma^2\over n} + {2\over n^2}\sum_{p=1}^{n-1}\sum_{t=1}^{n-p}s_p
\end{equation}
Finally:
\begin{equation}
(\Lambda m)^2 \eq 
{\sigma^2\over n} + {2\over n^2}\sum_{p=1}^{n-1}(n-p)s_p 
\label{sm-cor}
\end{equation}
A correction term appears, which can be 
positive or negative.
So $(\Lambda m)^2$ can indeed be larger or smaller than
$\sigma^2/n$.
For example,
with $n=2$,
$(\Lambda m)^2=\sigma^2/2 + s_1/2$ will be smaller than $\sigma^2/2$,
if the sequence $x_t$ is anticorrelated ($s_1<0$).


%\subsection {Sample variance}
%We have seen how to estimate the mean.
%To estimate the variance, we compute the {\it sample variance:}
%\begin{equation}
%{\hat \sigma^2} \eq
%{(x_1-{\hat m})^2+\cdots+(x_n-{\hat m})^2 \over n} \eq
%{\sum_{t=1}^n(x_t-{\hat m})^2\over n}
%\end{equation}
%For the mean we use the estimated mean
%${\hat m}$ since we do not know the true mean $m$.
%What is the error made in estimating the variance?

%Gilles--- I'm concerned that if all we are doing is computing
%a little bias, it won't have any experimental interest.
%People would just multiply the result by the trivial
%adjustment $n/(n-1)$.
%
%In fact, we don't need
%in first approximation to work on squared errors
%(as for the variance of the sample mean),
%because a bias will appear between ${\hat \sigma^2}$ and 
%the true $\sigma^2$: the expected value of ${\hat \sigma^2}$ is not $\sigma^2$.
%Effectively:
%\begin{equation}
%\E[{\hat \sigma^2}] \eq 
%{1\over n}\sum_{t=1}^n\E[(x_t-{\hat m})^2]
%\end{equation}
%Decomposing $(x_t-{\hat m})$ into $(x_t-m)$ and $(m-{\hat m})$, we get:
%\begin{eqnarray}
%\E[(x_t-{\hat m})^2] &=& \E[(x_t-m)^2+2(x_t-m)(m-{\hat m})+(m-{\hat m})^2] \\
%		     &=& \sigma^2 + 2\E[(x_t-m)(m-{\hat m})] + {\sigma^2\over n}
%\end{eqnarray}
%Now:
%\begin{equation}
%\E[(x_t-m)(m-{\hat m})] \eq  \E[(x_t-m){(m-x_1)+\cdots+(m-x_n)\over n}]
%\end{equation}
%Remembering that we suppose the $x_t$ independent of each other, we get:
%\begin{equation}
%\E[(x_t-m)(m-{\hat m})] \eq  -{\E[(x_t-m)^2]\over n} \eq  -{\sigma^2\over n}
%\end{equation}
%Finally:
%$$\E[(x_t-{\hat m})^2] \eq  \sigma^2 -{\sigma^2\over n}$$
%\begin{equation}
%\E[{\hat \sigma^2}] \eq  \sigma^2 - {\sigma^2\over n}
%\end{equation}
%
%So ${\hat \sigma^2}$ is expected to be smaller than $\sigma^2$,
%with an error of ${\sigma^2/ n}$.
%In the case of a 0-mean series, $\sigma^2=\E[x_t^2]=p$,
%and we can write this result as:
%\begin{equation}
%{|\Lambda p|\over p}={1\over n}
%\end{equation}
%Once again, we have a dilemma between the accuracy of the estimation,
%and the number of samples we have to use in the computation.
%If the variance varies with time,
%we want to use a small $n$, but we know
%the accuracy of the computation will be poor.
%We can as well rewrite this result as:
%\begin{equation} 
%{|\Lambda p|\over p}{\Lambda T\over dt}=1
%\label{4-2-25}
%\end{equation}
%


\subsection{Variance of the sample variance}

\par
The {\it variance of the sample variance} arises in many contexts.
Suppose you want to measure the storminess of the ocean.
You measure water level as a function of time and subtract the mean.
The storminess is the variance about the mean.
You measure the storminess in one minute and call it a sample storminess.
You compare it to other minutes and other locations and you find
they are not all the same.
To characterize these differences,
we need the {\it variance of the sample variance}
$\sigma_{\hat \sigma^2}^2$: once we have an estimation of the variance
(the sample variance), we compute the error in this estimation.


\par
Given a sample of a random time series $x_t$,
the true variance of $x_t$ is $\sigma^2_x$, with:
\begin{equation}
\sigma^2_x \eq \E[(x_t-m)^2]
\end{equation}
We have an {\it estimate} of $\sigma_x$ with:
\begin{equation}
\hat \sigma_x^2 \eq {(x_1-m)^2+\cdots+(x_n-m)^2\over n}
\end{equation}
(This sample variance $\hat \sigma_x^2$ is computed using the true mean $m$.
You can get a slightly different result using $\hat m$).


We would like to know the error we make in this estimation, 
i.e. the variance of the sample variance. We define another series 
by $y_t = (x_t-m)^2$. Notice that ${\hat \sigma_x^2}$ is actually the 
{\it sample mean} of the series $y_t$, whereas $\sigma_x^2$ is its true 
mean. 
Thus we apply equation~(\ref{rand.vsm}) to $y$, supposing the $y_t$ (i.e. $x_t$)
independent:
\begin{equation}
\sigma_{\hat \sigma_x^2} \ = \ \Lambda m_y \ = \ {\sigma_y \over \sqrt{n}}
	\label{rand.vsv}
\end{equation}
Notice that:
\begin{equation}
\sigma_y^2 \eq \E[y_t^2] - (\E[y_t])^2 \eq \E[(x_t-m)^4] - \sigma_x^4
\label{rand.fourth}
\end{equation}

Thus to know the fluctuation in $\sigma^2$ we need to know
$\sigma^4$, and to get the fluctuation in that we need to know...
I know of only two ways to escape this infinite regress.
First we could simply measure (\ref{rand.fourth})
on our data sample.
Alternately we could assume something about probability functions.
For example, if the data is known to be drawn from a Gaussian
probability function then it can be shown that
$\E[(x_t-m)^4]=3\sigma_x^4$.
For any probability function there is a fixed ratio between
the second and fourth moments.
Proceeding with the Gaussian assumption we have
\begin{equation}
\sigma_{\hat \sigma_x^2} \ \geq \ {\sigma_x^2\over \sqrt{n}}
\end{equation}
Choosing to work with $n$ points from a signal
(which is theoretically of infinite duration)
we have $\Lambda T =n\, dt$.
If the sequence is zero mean,
we can rewrite the previous inequality
in terms of the power $p_x$,
since $\sigma_x^2=p_x$.
\begin{equation}
\left( {\Lambda p_x \over p_x} \right)^2  {\Lambda T \over dt} \quad \geq 1 
\label{4-2-25} 
\end{equation}

Darche again points out that
if the samples $x_t$ were {\it not} independent,
the inequality~(\ref{4-2-25}) would be even more loose.
Effectively, remember equation~(\ref{sm-cor}),
and apply it to the sequence $y_t=(x_t-m)^2$:
\begin{equation}
(\Lambda m_y)^2 \eq 
{\sigma_y^2\over n} + {2\over n^2}\sum_{p=1}^{n-1}(n-p)s_p 
\end{equation}
where the $s_p$ are the autocorrelations of the sequence $y_t$. But as $y$
is formed of {\it positive} numbers,
its autocorrelations must also be positive.
Therefore now:
\begin{equation}
\Lambda m_y \ > \ {\sigma_y \over \sqrt{n}}
\end{equation}
which can be compared to equation~(\ref{rand.vsv}).

Anyway, in both cases (independent or not),
we are faced with the same dilemma:
if we want to have an accurate estimation of the variance,
we need a large number of samples,
with conflicts with the possibility to have a time-varying variance.

%\subsubsection{Variance of the sample variance ad infinitum}
%\par
%Another more esoteric occurrence of the variance of the sample variance
%arises when you get fussy about error bounds.
%In a scientific paper you may find
%\begin{eqnarray}
%m &= & \hat{m} \pm \Lambda m \label{4-2-21} \\
%   &= & \hat{m} \pm \sigma/\sqrt{n} \label{4-2-22}
%\end{eqnarray}
%but since the variance $\sigma^2$ often is not known either,
%it is necessary to use the estimated $\hat{\sigma}$,
%that is
%\begin{equation} 
%m \eq \hat{m} \pm \hat \sigma/\sqrt{n} \label{4-2-23}
%\end{equation}
%Of course (\ref{4-2-23}) really is not right
%because we really should add something
%to show that we use an estimation of $\sigma$,
%so that we have an additional uncertainty
%from error in $\hat{\sigma}$: we should include
%the variance of the sample variance. As we saw,
%we should have to estimate $\E(x_t^4)$, so again
%including another error...
%At this stage practical people generally turn their
%attention elsewhere leaving the statisticians
%who might compute the result from a presumed probability function,
%and the relation (\ref{4-2-23}) can be considered as sufficient.
%

\exer

\item Suppose the mean of a sample of random numbers is estimated by a
triangle weighting function, i.e.,
$$\hat{m} \eq  s \sum^n_{i = 0} (n - i)\, x_i $$
Find the scale factor $s$ so that $\E(\hat{m}) = m$.
Calculate $\Lambda m$.
Define a reasonable $\Lambda T$.
Examine the uncertainty relation.

\item A random series $x_t$ with a possibly time-variable mean
may have the mean estimated by the feedback equation
$$\hat{m}_t \eq  (1 - \epsilon) \hat{m}_{t-1} + bx_t$$

\begin{description}
\item[a.] Express $\hat{m}_t$ as a function of $x_t, x_{t-1}, \ldots ,$
and not $\hat{m}_{t - 1}$.

\item[b.] What is $\Lambda T$, the effective averaging time?

\item[c.] Find the scale factor $b$ so that if $m_t = m$,
then $\E(\hat{m}_t) = m.$

\item[d.] Compute the random error
$\Lambda m = \sqrt{\E(\hat{m} - m)^2 } $.
[{\sc hint:} $\Lambda m$ goes to
$\sigma \sqrt{\epsilon/2 }$ as $\epsilon \rightarrow 0$].

\item[e.] What is $(\Lambda m)^2 \Lambda T$ in this case?

\end{description}

%\item Show that
%$$(\Lambda P)^2 = {1 \over n} [\E(x^4) - \sigma^4]$$

\exerend


\section{SPECTRAL FLUCTUATIONS}
Recall the basic model of time series analysis,
namely random numbers passing through a filter.
An sample of input, filter, and output amplitude spectra
is shown in figure~\ref{rand-filter}.
\sideplot{rand-filter}{1.85in} {\FIGS/rand/filter}  {(rand-filter)
	Random numbers into a filter.
	Top is spectrum of random numbers.
	Middle is spectrum of filter.
	Bottom is spectrum of filter output.
	}
From the spectrum of the output we can guess the spectrum of the filter,
but the figure shows there are some limitations
in our ability to do so.
Let us analyze this formally.
\par
Observations of sea level for a long period of time
can be summarized in terms of a few statistical averages
such as the mean height $m$ and the variance $\sigma^2$.
Another important kind of statistical average
for use on such geophysical time series
is the {\it power spectrum.}
Many mathematical models
explain only such statistical averages of data and not the data themselves.
To recognize certain pitfalls
and understand certain fundamental limitations
on work with power spectra,
we first consider the idealized example of random numbers.

\plot{rand-auto}{2.15in} {\FIGS/rand/auto}  { (rand-auto)
	Autocorrelation and spectrum of random numbers.
	}

\par
Figure~\ref{rand-auto} shows a signal that is a burst of noise,
its Fourier transform, and the transform squared,
and its inverse transform, the autocorrelation.
Here the FT squared is the same
as the more usual FT times its complex conjugate---because
the noise burst signal is even so its FT is real.
\par
Notice that the autocorrelation has a big spike at zero lag.
This spike represents the correlation of the random numbers with themselves.
The other lags are much smaller.
They represent the correlation of the noise burst with itself shifted.
Theoretically,
the noise burst is not correlated with itself shifted
and these small fluctuations result
from the finite extent of the noise sample.
\par
Imagine many other copies of Figure~\ref{rand-auto}.
Ensemble averaging would be adding these other autocorrelations
or equivalently adding these other spectra.
The fluctuations aside the central lobe of the autocorrelation
would be destroyed by ensemble averaging.
The fluctuations in the spectrum would be smoothed out by ensemble averaging.
The {\it expectation} of the autocorrelation
is that it is an impulse at zero lag.
The {\it expectation} of the spectrum
is that it is a constant, namely
\begin{equation}
\E[\hat S(Z)] \eq S(Z) \eq {\rm const}
\end{equation}

\subsection{Paradox: $N \rightarrow \infty$ vrs the ensemble average}
\par
Now for the paradox.
Imagine $n \rightarrow \infty$ in Figure~\ref{rand-auto}.
Will we see the same limit as the ensemble average?
Here are two contradictory points of view:
\begin{itemize}
\item
For increasing $n$,
the fluctuations on the non-zero autocorrelation lags get smaller,
so the autocorrelation should tend to a impulse function.
Its Fourier transform, the spectrum, should tend to a constant.
\item
But, for increasing $n$,
as Figure~\ref{rand-nrand} on page~\pageref{'rand-nrand'},
the spectrum does not get any smoother,
because the FT's should still look like random noise.
\end{itemize}
We will see that the first idea contains a false assumption.
The autocorrelation does tend to an impulse,
but the fuzz around the sides can't be ignored---although
the fuzz tends to zero amplitude,
it also tends to infinite extent
and the product of zero with infinity here
tends to have the same energy as the central impulse.

\par
To examine this further,
let's see how these autocorrelations decrease
to 0 with $n$ (the number of samples).
Figure~\ref{rand-fluct} shows the autocorrelation samples as a function of $n$
in steps of $n$ increasing by factors of four.
Thus $\sqrt{n}$ increases by factors of two.
\plot{rand-fluct}{2.48in} {\FIGS/rand/fluct}  { (rand-fluct)
	Autocorrelation as a function of number of data points.
	The random noise series (even) lengths are
	60, 240, 960.
	}
Each autocorrelation in the figure was normalized at zero lag.
We see the sample variance for nonzero lags of the autocorrelation
dropping off as $\sqrt{n}$.
We also observe that the ratios between the values for the first non-zero
lags and the value at lag 0 roughly fit $1/\sqrt{n}$.
Notice also that the fluctuations drop off with lag.
The drop off goes to zero at a lag equal the sample length.
This is because 
the number of terms in the autocorrelation diminishes to zero at that lag.
A first impression is that the autocorrelation fits a triangular envelope.
More careful inspection shows
that the triangle bulges upward at wide offsets,
or large values of $k$
(slightly clearer on Figure~\ref{rand-auto}).

Let's explain all these observations.
Each lag of the autocorrelation is defined as:
\begin{equation}
s_k \eq \sum_{t=1}^{n-k}x_tx_{t+k} 
\end{equation}
where $(x_t)$ is a sequence of 0-mean {\it independent} random variables.
Thus, the expectations of the autocorrelations can be easily computed:
\begin{eqnarray}
\E(s_0) &\eq& \sum_1^n\E(x_t^2) \eq n\sigma^2_x \\
\E(s_k) &\eq& \sum_1^{n-k}\E(x_t)\E(x_{t+k}) \eq 0.  \mbox{\hspace{0.5cm}(for $k\geq 1$)}
\end{eqnarray}
On Figure~\ref{rand-fluct},
the value at lag 0 is more or less $n\sigma^2_x$ (before normalization),
the deviation being more or less the {\it standard deviation}
(square-root of the variance) of $s_0$.
On the other hand, 
for $k\geq 1$,
as $\E(s_k)=0$, the value of the autocorrelation is directly 
the deviation of $s_k$,
i.e.~something close to its standard deviation.

We now have to compute the variances of the $s_k$.
Let's write:
\begin{equation}
s_k \eq
\sum_{t=1}^{n-k}y_k(t)
\mbox{\hspace{1.0cm}(where $y_k(t) = x_tx_{t+k}$)}
\end{equation}
So: $s_k=(n-k){\hat m_{y_k}}$, where ${\hat m_{y_k}}$ is the sample mean
of $y_k$ with $n-k$ terms. If $k\neq 0$, $\E(y_k)=0$, and we apply 
(\ref{rand.vsm}) to ${\hat m_{y_k}}$:
\begin{equation}
\E ({\hat m_{y_k}}^2) \eq {\sigma^2_{y_k} \over n-k}
\end{equation}
The computation of $\sigma^2_{y_k}$ is straightforward:
\begin{equation}
\sigma^2_{y_k} \eq  \E(x^2_tx^2_{t+k}) \eq
\E(x_t^2)\E(x_{t+k}^2) \eq
\sigma^4_x \;, 
\end{equation}
So, for the autocorrelation $s_k$:
\begin{equation}
\E(s_k^2) \eq
(n-k)\sigma^2_{y_k} \eq
(n-k)\sigma^4_x \eq
{n-k\over n^2}(\E(s_0))^2
\end{equation}
Finally, as $\E(s_k)=0$, we get:
\begin{equation}
\sigma_{s_k} \eq \sqrt{E(s_k^2)}  \eq \E(s_0) \  {\sqrt{n-k} \over n} 
\label{rand.vsvcor}
\end{equation}
This final result explains the properties observed on Figure~\ref{rand-fluct}.
As $n\rightarrow\infty$,
all the non-0 lags tend to 0 compared to the 0-lag,
since $\sqrt{n-k}/n$ tends to 0.
Then, the first lags 
($k<<n$) give this ratio $1/\sqrt{n}$ between the autocorrelations and the 
value at lag 0.
Finally, the autocorrelations don't decrease linearly with $k$,
because of $\sqrt{n-k}$.
%\par
%But the paradox only heightens,
%how can the autocorrelation go to an impulse
%without the spectrum going to a constant?
\par
Now, we can explain the paradox. The energy of the non-0 lags will be:
\begin{equation}
{\cal E} \eq
\sum_{k\neq 0}\E(s_k^2) \eq
{(\E(s_0))^2\over n^2}\sum_{k=1}^n(n-k) \eq
(\E(s_0))^2{n(n-1)\over n^2} \ \sim \ {(\E(s_0))^2\over 2} \;.
\end{equation}
So there is a conflict between the decrease to 0 of the autocorrelations
and the increasing number of the non-0 lags,
which prevents this energy to decrease to 0.
The autocorrelation does not {\it globally} tend to a impulse function.

Now, in the frequency domain, the spectrum $S(\omega)$ is:
\begin{equation}
S(\omega ) \eq  {1\over n}(s_0 +s_1 \cos\omega + s_2 \cos 2 \omega + \cdots)
\end{equation}
So $\E[S(\omega)]=(1/n)\E[s_0]=\sigma_x^2$,
and the {\it average} spectrum is a constant,
independent of the frequency.
But as the $s_k$ fluctuate
more or less like $\E[s_0]/\sqrt{n}$,
and as their count in $S(\omega)$ 
is increasing with $n$,
we will observe that $S(\omega)$ will also fluctuate, 
and indeed:
\begin{equation}
S(\omega ) \eq {1\over n}\E[s_0] \pm {1\over n}\E[s_0]
	\eq \sigma^2_x\pm\sigma^2_x
\end{equation}
This explains why the spectrum remains fuzzy,
since the fluctuation is independent of the number of samples,
while the autocorrelation seems to tend to an impulse.
In conclusion,
the expectation (ensemble average) of the spectrum
is not properly estimated by letting $n \rightarrow \infty$ in a sample.

%Recall the delta function, a rectangle of width $\epsilon$ and height
%$1/\epsilon$,
%so as $\epsilon \rightarrow 0$,
%the integral of the delta function is unity.
%Something the opposite is happening with the fluctuation flanks of the
%autocorrelation.
%The fluctuation size is dropping with $\sqrt{n}$ but the
%{\it count of the terms} in
%$S(\omega ) = s_0 +s_1 \cos\omega + s_2 \cos 2 \omega + \cdots$
%is increasing with $n$.
%Indeed the {\it area} seems to {\it increase} as $\sqrt{n}$
%until we account for the randomness of the values being summed,
%which drops the area in the fluctuations to the area of the impulse
%at zero lag.
%Thus $S( \omega ) = s_0 \pm s_0$.
%This resolves the paradox.
%The spectrum does remain forever fuzzy while the autocorrelation tends to
%an impulse.
%In conclusion, the expectation (ensemble average) of the spectrum
%is not properly estimated by letting $n \rightarrow \infty$ in a sample.

\subsection{Example of bandwidth/reliability tradeoff}

Letting $n$ go to infinity
does not take us to the expectation $\hat{S} = \sigma^2$.
The problem is, as we increase $n$,
we increase the frequency resolution but not the statistical resolution
(i.e. the fluctuation around $\hat{S}$).
To increase the statistical resolution
we need to simulate ensemble averaging.
There are two ways to do this:
\begin{enumerate}
\item
Take the sample of $n$ points
and break it into $k$ equal-length segments of $n/k$ points each.
Compute an $S(\omega)$ for each segment and then average all $k$
of the $S(\omega)$ together.
The variance of the average spectrum 
is equal to the variance of each spectrum ($\sigma_x^2$) {\it divided} by
the number of segments,
and so the fluctuation is substantially reduced.
\item
Form $S(\omega)$ from the $n$-point sample.
Of the $n/2$ independent amplitudes,
replace each one by an average over its $k$ nearest neighbors.
This could also be done by tapering the autocorrelation.
\end{enumerate}
\plot{rand-taper}{3.27in} {\FIGS/rand/taper}  {(rand-taper)
	Spectral smoothing by tapering the autocorrelation.
	$\Lambda T$ is constant determined on the top row.
	Successive rows show $\Lambda F$ increasing while
	$\Lambda p=\sigma_{\hat\sigma}$ decreases.
	The width of a superimposed box roughly gives
	$\Lambda F$ and its height roughly gives
	$\Lambda p$.
	}
The second method is shown in Figure~\ref{rand-taper}.
This figure shows a noise burst of 240 points.
Since the signal is even,
the noise burst is effectively 480 points wide,
so the autocorrelation is 480 points from center to end: the 
number of samples will be the same for all cases.
The spectrum is very rough.
Multiplying the autocorrelation by a triangle function effectively
smooths the spectrum by a sinc squared function, thus reducing the 
spectral resolution ($1/\Lambda F$). Notice that $\Lambda F$
is equal here to the width of the sinc squared function, which is
inversely proportional to the length of the triangle ($\Lambda T_{\rm auto}$).

However, the first taper takes the autocorrelation width from 480 lags 
to 120 lags.
Thus the spectral fluctuations $\Lambda S$ should drop by a factor of 2,
since the count of terms $s_k$ in $S(\omega)$ is 
reduced to 120 lags:
\begin{equation}
(\Lambda S)^2 \eq
{k\over n^2}(\sigma_{s_0}^2+\cdots+\sigma_{s_k}^2) \sim {k\over n}\sigma_x^2
\end{equation}
The next weighted autocorrelation
has its width dropped from 480 to 30 lags.
Thus spectral roughness should drop by another factor of 2.
In all cases,
the {\it average} spectrum is unchanged,
since the first lag of the autocorrelations is unchanged.
This implies a reduction
in the relative spectral fluctuation proportional to the square root
of the length of the triangle ($\sqrt{\Lambda T_{\rm auto}}$).

As you can see, the reduction in spectral fluctuation $\Lambda p / p$
is accompanied by a reduction in spectral resolution $1/\Lambda F$, 
and in fact $\Lambda F(\Lambda p/p)^2$ should be a constant.
\par
Convergence seems slow, but the results would look twice as smooth
if {\it amplitudes} were plotted instead of {\it power.}
The reason is this:
Let power $P$ equal squared amplitude $A^2=P$.
Thus $dP/dA = 2A\,dA$,
or
\begin{equation}
{\Lambda P \over P} \eq 2 \,
{\Lambda A \over A} 
\end{equation}

\subsection{Spectral estimation}
In Figure~\ref{rand-taper} we didn't care about spectral resolution,
since we knew theoretically that the spectrum was white.
But in practice we don't have such foreknowledge.
Indeed, the random factors we deal with in nature rarely are white.
A widely used model for naturally occurring random functions,
such as microseism, or sometimes reflection seismograms,
is that they may be modeled by putting white noise into a filter.
The spectra for an example of this type is shown in Figure~\ref{rand-filter}.
You can see that smoothing the envelope
of the power spectrum of the output
gives an estimate of the spectrum of the filter.
But you also see that the estimate may need even more smoothing.

\subsection{Time-frequency-statistical resolution}

Equations~(\ref{4-2-25}) and (\ref{4-1-1}) applied to a single frequency, 
namely the zero frequency.
But the principles underlying equation (\ref{4-2-25})
apply to all frequencies.
Therefore

\begin{equation}
\Lambda F \, \Lambda T\, \left( {\Lambda p \over p}\right)^2
\quad > \quad 1
  \label{4-3-9}
\end{equation}
We find this idea on Figure~\ref{rand-taper}. I said earlier that 
$\Lambda F(\Lambda p/p)^2$ should be a constant. Actually:
\begin{equation}
\Lambda F \sim 1/\Lambda T_{\rm auto}
\mbox{\hspace{0.5cm}and\hspace{0.5cm}}
\left({\Lambda p\over p}\right)^2 \sim
{\Lambda T_{\rm auto}\over\Lambda T}
\end{equation}
So we find in this case that:
\begin{equation}
\Lambda F \, \Lambda T\, \left( {\Lambda p \over p}\right)^2 \ \sim 1
\end{equation}


\section{CROSSCORRELATION AND COHERENCY}
With two time series we can see how they are related.

\subsection{Correlation}
Correlation is a concept similar to cosine.
A cosine measures the angle between two vectors.
It is given by the dot product of the two vectors divided by their magnitudes
\begin{equation}
c \eq { ({\bf x} \cdot {\bf y})
\over
\sqrt{ ({\bf x} \cdot {\bf x}) ({\bf y} \cdot {\bf y}) } }
\end{equation}
which is the sample normalized correlation
we first encountered on page~\pageref{'normalized_correlation'}
as a quality measure of fitting one image to another.

\par
Formally, the normalized correlation is defined
with $x$ and $y$ being zero-mean scalar random variables
instead of sample vectors.
So the summation is an expectation instead of a dot product.
\begin{equation}
c \eq {\E(xy) \over \sqrt{ \E(x^2)\, \E(y^2) }  }
\end{equation}

\par
There is a practical difficulty
when the ensemble averaging is simulated over a sample.
The problem arises with small samples
and is most dramatically illustrated for a sample with only one element.
Then the sample correlation is
\begin{equation}
\hat{c} \eq {xy \over |x| \, |y|} \eq \pm 1
\end{equation}
regardless of what value the random number $x$
or the random number $y$ should take.
For any $n$, the sample correlation $\hat{c}$ scatters away from zero.
Such scatter is called {\it bias.}
The topic of bias and variance of coherency estimates is a complicated one,
but a rule of thumb seems to be to expect bias and variance
of $\hat{c}$ about $1/\sqrt{n}$ for samples of size $n$.
Bias, no doubt, accounts for many false ``discoveries''
since cause-and-effect is often inferred from correlation.

\subsection{Coherency}
The concept of {\it coherency} in time-series analysis
is analogous to correlation.
Taking $x_t$ and $y_t$ to be time series,
they may have a mutual relationship 
that depends on time-delay, scaling, or even filtering.
For example, perhaps $Y(Z) = F(Z) X(Z) + N(Z)$ where $F(Z)$ is a filter and
$n_t$ is unrelated noise.
The generalization of the correlation concept is to define {\it coherency} by
\begin{equation}
C \eq {\E\left[ X \left( {1 \over Z}\right) \, Y(Z) \right] \over
\sqrt{\E(\overline {X} X)\, \E(\overline {Y} Y) } }
\end{equation}
\par
{\it Correlation} is a real scalar.
{\it Coherency} is a complex function of frequency
and it expresses the frequency dependence of correlation.
In forming an estimate of coherency it is
always essential to simulate ensemble averaging.
Note that if the ensemble averaging were to be omitted,
the coherency (squared) calculation would give
\begin{equation}
|C|^2 \eq \overline{C} C \eq {(\overline{\overline{X} Y})
(\overline{X} Y) \over (\overline{X} X)
(\overline{Y} Y)} \eq {\rm const}(\omega)
\end{equation}
which states that the coherency squared is unity independent of the data.
Because correlation scatters away from zero we find that coherency squared is
biased away from zero.

\subsection{Bispectrum}
The {\it bispectrum} is another statistic
that is used to search for nonlinear interactions.
For a Fourier transform $F(\omega)$ it is
defined by
\begin{equation}
B(\omega_1,\omega_2) \eq
\E [ F(\omega_1) F(\omega_2) \overline{F(\omega_1+\omega_2)}]
\end{equation}
Another statistic defined analogously is the {\it bispectral coherency}.
In seismology, signals rarely have adequate duration
for time averages to make sensible bispectral estimates.


\section{SMOOTHING IN TWO DIMENSIONS}

In previous sections we assumed one-dimensional models,
and that you could easily do the smoothing.
In two dimensions things are nominally much more costly,
but a few things have tricks, so they can be easy too.
Here I tell you my favorite trick for smoothing in two dimensions.
You can convolve with a two-dimensional (almost) Gaussian weighting function
{\it of any area} for a cost of only sixteen additions per output point.
(You might expect a cost proportional to the area).

\subsection{Tent smoothing}
\par
First recall triangular smoothing in one dimension
documented by subroutine {\tt triangle()} on page~\pageref{triangle}.
This routine is easily adapted to two dimensions.
First you smooth in the direction of the 1-axis for all values
of the 2-axis.
Then you do the reverse,
convolve on the 2-axis for all values of the 1-axis.
Now recall that smoothing with a rectangle is especially fast,
because you do not need to add all the points within the rectangle.
You merely adapt a shifted rectangle by adding a point at one end
and subtracting a point at the other end of the rectangle.
In other words,
the cost of smoothing is independent of the width of the rectangle.
And no multiplies are required.
To get a triangle, you smooth twice with rectangles.

\par
Figure~\ref{rand-pyram} shows
the application of triangle smoothers on two pulses in a plane.
The plane was first convolved with a triangle on the 1-axis
and then with another triangle on the 2-axis.
This takes each impulse and smooths it into an interesting pyramid
that I call a tent.
\plot{rand-pyram}{1.69in} {\FIGS/rand/pyram}  { (rand-pyram)
	Two impulses in two dimensions filtered with a triangle function
	along each spatial axis.
	Left: bird's eye view.
	Right: contours of constant altitude $z$.
	}
The expected side-boundary effect is visible on the foreground tent.
In the contour plot (of the same 120 by 40 mesh) you
see that the cross-section of the tent is rectangular near the base
and diamond shaped near the top.
The altitude of the $j$th tent face is $z = a(x-x_j)(y-y_j)$
where $(x_j, y_j)$ is the location of a corner and $a$ is a scale.
The tent surface is parabolic (like $z=x^2$) along $x=y$
but linear along lines parallel to the axes.
A contour of constant $z$ is the (hyperbolic) curve $y = a+b/(x+c)$
(where $a$, $b$, and $c$ are different constants on each of the four faces).
\label{'2Dgauss'}

\subsection{Gaussian mounds}
In Figure~\ref{rand-mound}
you see the result of twice application of tent smoothing.
\plot{rand-mound}{1.47in} {\FIGS/rand/mound}  { (rand-mound)
	Two impulses in two dimensions filtered
	twice on each axis with a triangle function.
	Left: bird's eye view.
	Right: contours of constant altitude $z$.
	}
Notice that the contours, instead of being diamonds and rectangles,
have become much more circular.
The reason for this is briefly stated as follows:
Convolution of a rectangle with itself many times
approachs the limit of a Gaussian function.
(This is a well-known result called the ``central limit theorem''.
It explained in the next Section.)
It happens that the convolution of a triangle with itself
is already a good approximation to the Gaussian function
$z(x)= e^{-x^2}$.
The convolution in $y$ gives $z(x,y)=e^{-x^2-y^2}= e^{-r^2}$
where $r$ is the radius of the circle.
When the triangle on the 1-axis differs in width
from the triangle on the 2-axis,
then the circles become ellipses.

\subsection{Speed of 2D Gaussian smoothing}
\par
This approximate Gaussian smoothing in two dimensions is very fast.
Only eight add-subtract pairs are required per output point,
and no multiplies at all except for final scaling.
The compute time is independent of the widths of the Gaussian(!)
You should understand this if you understood
that one-dimensional convolution with a rectangle
requires just one add-subtract pair per output point.
Thus this technique should be useful in the process known
as 2-dimensional slant stack.

\exer

\item
Deduce that a 2-D filter based on the subroutine
{\tt triangle()}
on page~\pageref{triangle}
to produce the 2-D quasi-Gaussian mound in Fig~\ref{rand-mound}
has a gain of unity at zero (two-dimensional) frequency
(also known as $(k_x,k_y)=0$).

\item
Let the 2-D quasi-Gaussian filter be known as $F$.
Sketch the spectral response of $F$.

\item
Sketch the spectral response of $1-F$ and suggest a use for it.

\item
The tent filter can be implemented by smoothing first on the 1-axis
and then on the 2-axis.
The conjugate operator smooths first on the 2-axis
and then on the 1-axis.
The operator should be self-adjoint (equal to its conjugate),
unless some complication arises at the sides or corners.
How can a dot product test be used to see
if a tent filter program is self-adjoint?

\exerend


\section{PROBABILITY AND THE CENTRAL-LIMIT THEOREM}

One way to obtain random integers from a known probability function
is to write integers on slips of paper and place them in a hat.
Draw one slip at a time.
After each drawing replace the slip in the hat.
The probability of drawing the integer $i$
is given by the ratio $a_i$ of the number of slips
containing the integer $i$ divided by the total number of slips.
Obviously the sum over $i$ of $a_i$ must be unity.
Another way to get random integers
is to throw one of a pair of dice.
Then all $a_i$ equal zero except
$a_1 = a_2 = a_3 = a_4 = a_5 = a_6 = {1 \over 6}$.
The probability that the integer $i$ will occur
on the first drawing and the integer $j$ will occur
on the second drawing is $a_i a_j$.
If you draw two slips or throw a pair of dice,
then the probability that
the sum of $i$ and $j$ equals $k$ is
the sum of all the possible ways this can happen.
\begin{equation}
c_k \eq \sum_i a_i a_{k-i} \label{4-5-1}
\end{equation}
Since this equation is a convolution,
we may look into the meaning of the $Z$ transform
\begin{equation}
A(Z) \eq \cdots a_{-1} Z^{-1} + a_0 + a_1 Z + a_2 Z^2 + \cdots
 \label{4-5-2}
\end{equation}
In terms of $Z$ transforms
the probability that $i$ plus $j$ equals $k$ is 
simply the coefficient of $Z^k$ in
\begin{equation}
C(Z) \eq A(Z)\, A(Z) \label{4-5-3}
\end{equation}

\exer
\item
A random number generator provides random integers 2, 3, and 6 with
probabilities
$p(2)=1/2$,
$p(3)=1/3$, and
$p(6)=1/6$.
What is the probability that any given integer $n$
is the sum of three of these random numbers?
(HINT: Leave the result in the form of coefficients of a complicated
polynomial).
\exerend

\subsection{The central-limit theorem}
The central-limit theorem of probability and statistics
is perhaps the most important theorem
in the fields of probability and statistics.
A derivation of the central limit theorem
explains why the Gaussian probability function
is so frequently encountered in nature;
not just in physics but also in the biological and social sciences.
No experimental scientist should be unaware
of the basic ideas behind this theorem.
Although the result is deep
and is even today the topic of active research,
we can quickly go to the basic idea.

\par
From equation (\ref{4-5-3}),
if we add $n$ random numbers,
the probability that the sum of them equals $k$
is given by the coefficient of $Z^k$ in
\begin{equation}
G(Z) \eq [A(Z)]^n \label{4-5-4}
\end{equation}
The central-limit theorem of probability says that as $n$ goes to infinity the
polynomial $G(Z)$ goes to a special form,
almost regardless of the specific polynomial $A(Z)$.
The specific form is such that a graph of the 
coefficients of $G(Z)$ comes closer and closer to fitting under the envelope
of the bell-shaped Gaussian function.
This happens because if you raise any function to a high enough power,
eventually all you can see is the highest value on the function
and its immediate environment,
i.e. the second derivative there.
We already saw an example of this in Figure~\ref{conv-gauss}.
Exceptions to the central-limit theorem arise
(1) when there are multiple maxima the same height,
and (2) where the second derivative vanishes at the maximum.

\plot{rand-clim}{1.11in}{\FIGS/rand/clim}{ (rand-clim) 
	Left: wiggle plot style.  Middle: perspective.  Right: contour.
	}
\par
Although the central limit theorem tells you
that a Gaussian function is the limit as $n \rightarrow \infty$
it doesn't say anything about how fast the limit is attained.
To test it, I plotted the coefficients of
$(.25/Z+.5+.25Z)^n$ for large values of $n$.
This signal is scaled binomial coefficients.
To keep signals in a suitable amplitude scale,
I multiplied them by $\sqrt{n}$.
Figure~\ref{rand-clim} shows views
of the coefficients of $\sqrt{n}(.25/Z+.5+.25Z)^n$ (horizontal axis)
versus $\sqrt{n}$ (vertical axis).
We see that scaling by $\sqrt{n}$ has kept signal peak amplitudes constant.
We see the width of the signal increases linearly with $\sqrt{n}$.
The contours of constant amplitude show that the various orders
are nearly perfectly self similar with the width stretching.

\todo{Robust estimation in robust.tex}