%+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
%+                                                                           +
%+  NOTICE:  This article is also available formatted with illustrations.    +
%+                                                                           +
%+  SEE:     http://sepwww.stanford.edu/oldreports/sep61/                    +
%+                                                                           +
%+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

\lefthead{Gilles Darche}
\righthead{$L^1$ Deconvolution}
\footer{SEP-61}
\title{Iterative {\Large $L^1$} deconvolution}
\author{Gilles Darche}

\def\im{/r0/gilles/text}
\def\beq{\begin{equation}}

\def\eeq{\end{equation}}

\def\xvec{(x_1,\ldots,x_{n_x})}

\def\yvec{(y_1,\ldots,y_{n_y})}

\def\bit{\begin{itemize}}

\def\eit{\end{itemize}}
\ABS{
An alternative to the classical Wiener deconvolution is to
minimize the $L^1$ norm of the residuals rather than their $L^2$ norm.
This minimization is achieved by using an iteratively reweighted least-squares
algorithm, which finds $L^1$ minima, starting from $L^2$ minima. The 
solution found by this algorithm is still close to the result of the $L^2$ 
deconvolution, and does not resolve the indeterminacy of the 
phase of the seismic wavelet. However, this algorithm shows the efficiency of 
$L^1$ deconvolution in presence of noise bursts in the seismic data, and could 
be used for more general noise problems.
}

\INT
For about ten years, iterative algorithms called IRLS (Iterative Reweighted 
Least-Squares) algorithms have been developed to solve $L^p$-norm minimization 
problems, for $1\leq\!p\!\leq 2$, so lying between the least-absolute-values 
problem and the classical least-squares problem. Their main advantage is to 
provide an easy way to compute a $L^1$ solution; previously, $L^1$ solutions 
were found by linear programming techniques (Taylor et al., 1979), which need 
a large quantity of computer memory. $L^1$ solutions are known to be more 
``robust'' than $L^2$ solutions, being less sensitive to spiky high-amplitude
noise (Claerbout and Muir, 1973).

The applications of IRLS algorithms to geophysical problems have 
been limited up to now to travel-time tomography (Scales et al., 1988), 1-D 
acoustic inversion (Gersztenkorn et al.,1986). A simple example of deconvolution
on a synthetic trace was given by Yarlagadda et al. (1985).

Each iteration of the IRLS algorithm consists of a weighted least-squares
minimization, which is usually solved by a conjugate-gradient method. In many 
cases, such as tomography, this method is very efficient; if the linear systems
involved are sparse, this method is also computationally fast. 

However, deconvolution processes don't offer the same guarantees of 
computational efficiency, because they are {\it by essence} ill-conditioned
and require many computations.  Consequently, this algorithm demands some care 
before it can be implemented for deconvolution problems.

I will show the efficiency of the $L^1$-norm algorithm in presence of spiky 
noise using a simple synthetic example. This algorithm will in fact sharpen the 
$L^2$ residuals, as I will illustrate for predictive deconvolution; however, in
that case, it does not improve the estimation of the seismic wavelet. This 
illustration will be completed by an example on a marine CMP gather.

Finally, I suggest the extension of this algorithm to use a mixed 
$L^1-L^2$ norm, for more general noise problems. 

\mhead{$L^p$ MINIMIZATION WITH THE IRLS ALGORITHM}
\shead{Normal equations}
The iterative reweighted least-squares (IRLS) algorithm provides a mean by 
which linear systems can be solved by minimizing the $L^p$ norm of the residuals 
($1\leq\! p\!\leq 2$). Let us suppose that we try to solve the set of equations:
$$ A.x = y \mbox{\hspace{2.0cm}} x\in {\cal R}^m,\; y\in {\cal R}^n,\; A\in {\cal R}^{n\times m}\:.$$
To find the $L^p$ solution of this system we minimize the $L^p$ norm 
of the residuals:
$$ {\cal N}_p(x) = (\sum_{i=1}^n |y_i - \sum_{j=1}^m A_{ij}x_j|^p)^{{1\over p}} = {\cal L}_p(y-A.x)\:.$$
${\cal L}_p$ is not a norm for $p<1$, so, for the moment, I will only consider 
$p\geq 1$ . I also omit the exponent $1/p$ . 
To establish the normal equations, we set:
$$ {\partial{\cal N}_p \over\partial x_k} = 0 \mbox{\hspace{2.0cm} for k=1 to m}\:.$$
The partial derivatives can be expressed as follows:
$$ {\partial{\cal N}_p \over\partial x_k} = -p\sum_{i=1}^n A_{ik}{\rm sign}(r(i))\:|r(i)|^{p-1}\:.$$
where $r(i)=y_i-\sum_{j=1}^mA_{ij}x_j$ are the residuals.
Then:
\beq
 {\partial{\cal N}_p \over\partial x_k} = -p\sum_{i=1}^n A_{ik}r(i) |r(i)|^{p-2}\:.
\eeq 
To form the normal equations we can now replace $r(i)$ by its expression:
$$ \sum_{i=1}^n A_{ik}|r(i)|^{p-2}(\sum_{j=1}^mA_{ij}x_j) = \sum_{i=1}^n A_{ik}|r(i)|^{p-2}y_i$$
or, after some reorganization:
\beq
 \sum_{j=1}^m(\sum_{i=1}^nA_{ik}|r(i)|^{p-2}A_{ij})x_j = \sum_{i=1}^n A_{ik}|r(i)|^{p-2}y_i\:.
\eeq
There are $m$ such equations ($k=1$ to $m$). They can be expressed shortly in a 
matrix formulation:
\beq
 W(i) = |r(i)|^{p-2} \mbox{\hspace{1.0cm} for i=1 to n ,} 
\label{weight}
\eeq
$$ W = {\rm diag}(W(1),\cdots,W(n))\;,$$
and finally:
\beq
 A^TWA.x = A^TW.y\;.
\label{normal}
\eeq
This formulation is implicit and non-linear since the residuals $r(i)$, and 
so the matrix $W$, depend on the unknown vector $x$. However, for the classical
least-squares inversion ($p$=2), $W$ is the identity matrix, and 
equation~(\ref{normal}) is the usual least-squares equation:
\beq
 A^TA.x = A^T.y\;.
\label{L2normal}
\eeq

\shead{IRLS algorithm}
At the iteration $k+1$, the algorithm solves:
\beq
 A^TW^kA.x^{k+1} = A^TW^k.y 
\label{iter}
\eeq
by taking:
\bit
\item $W^0$ = $I_n$ (Identity matrix), at the first iteration,
\item $W^k$ formed with the residuals of iteration $k$ ($r^k=y-Ax^k$), at the 
iteration $k+1$ .
\eit
Byrd and Payne (1979) showed that this algorithm is convergent under two 
conditions:
\bit
\item $W(i)$ must be non-increasing in $|r(i)|$,
\item $W(i)$ must be bounded for all $i$.
\eit
The first condition is satisfied for $p\leq 2$; to assure the second 
condition, Huber (1981) suggested replacing equation~(\ref{weight}) with:
\beq
W(i) = \left\{ \begin{array}{ll} 
		|r(i)|^{p-2} & \mbox{if $|r(i)|>\varepsilon$} \\
		\varepsilon^{p-2}    & \mbox{if $|r(i)|\leq \varepsilon$}
		\end{array}
	\right.
\label{eps}
\eeq
for a small positive constant $\varepsilon$. \\
Equations~(\ref{iter}) are the normal equations of the least-squares 
minimization of:
$${\cal N}_2^k(x) = (y-Ax)^TW^k(y-Ax) \:.$$
$W^k$ has the role of a weighting matrix, the inverse of a 
covariance matrix. By giving the large residuals small weights 
(or large variances), it reduces their role; on the contrary, well predicted 
values will be given large weights (small variances). Aberrant values, 
receiving low weights, will have a small influence: this makes the algorithm 
robust, less sensitive to noise bursts in the data. This is especially true for
the extreme case $p=1$, as it had been predicted by Claerbout and Muir (1973).

However, the $L^1$ norm is not strictly convex; so, the $L^1$ minimization 
problem might have non-unique solutions. The IRLS algorithm will lead 
us to one of these solutions. The fact that the first iteration ($W^0=I_n$) 
is a least-squares inversion makes me think that this algorithm will lead us 
to a solution close to the $L^2$ solution. The choice of $W^0(i)=|y(i)|^{p-2}$ 
is not reasonable: during the 
algorithm, $x$ stays close to 0, and the algorithm converges (if it does) to 
a vector $x_{\infty}$ also close to zero, the residuals staying close to the 
original vector $y$.


\shead{Resolution of the normal weighted equations}
We still have to solve the normal equations~(\ref{iter}), at each step of the 
general algorithm. I omit the indexes $k$ and $k+1$ in what follows.

The matrix $A^TWA$ is symmetric and positive. However, it does not have 
a particular structure (such as the Toeplitz structure of $A^TA$ in 
deconvolution). The computations of $A^TWA$ and of its inverse are very 
expensive, in time and in memory. That is why several authors 
(Yarlagadda et al., 1985; Gersztenkorn et al., 1986; Scales et al., 1988) 
prefer iterative methods, especially conjugate gradient (``CG'') methods, to 
direct inversion methods, like singular-value decomposition or 
orthogonalization by Givens rotations.

The advantages of conjugate-gradient methods are numerous. The matrix $A^TWA$
does not need to be computed and stored. The algorithm ``only'' requires the 
computation of the gradient $g = A^TW.e$ ($e$: residuals) and of $WA.h$ 
($h$: conjugate direction).
The vectors $e$ and $h$ are updated at each iteration of the CG algorithm.
$W$ being diagonal, the product between $W$ and a vector is trivial. Moreover,
if the matrix $A$ is sparse, its product by a vector can also be easily 
computed. If the problem is ill-conditioned, limiting the number of iterations
is supposed to have an effect similar to that produced by adding a damping 
factor.

Finally, the convergence of the CG algorithm is controlled by three factors:
\bit
\item the condition number of the problem;
\item the choice of an initial estimate $x^0$ of $x$; the problem is modified 
into the resolution of: $(y-Ax^0) - A.\delta x$ = 0.
\item the structure of the eigenspectrum of $A^TWA$ itself: the more gathered 
the eigenvalues are, the quicker the algorithm will converge.
\eit
In theory, if the matrix $A^TWA$ has N different non-zero eigenvalues, the CG 
algorithm should converge without round-off errors in N iterations.

I will use the conjugate-gradient method in the IRLS algorithm applied to 
deconvolution. However, as I will show, its efficiency will not be optimal, 
unless some precautions are taken. For what follows, I mainly consider the 
case of the $L^1$ deconvolution, with $p=1$.
\mhead{IRLS ALGORITHM APPLIED TO $L^1$ DECONVOLUTION}
\shead{Formulation of the problem}
We want to solve the deconvolution problem:
\beq
y = a * x ,
\label{conv}
\eeq
$y$ being the data, $x$ the unknowns, $a$ the convolution filter.
It can be written in the matricial form $A.x=y$, with:
$$ x=\xvec^T\;,\mbox{\hspace{1.0cm}} y=\yvec^T\;,\mbox{\hspace{1.0cm}}n_y = n_a + n_x -1\;,$$
\beq
A=\pmatrix{a_1   &   &      &      & \cr
	   a_2   &a_1&      &0     & \cr
	   \vdots&a_2&\ddots   &      & \cr
	   \vdots&   &\ddots&\ddots& \cr
	   a_{n_a}&  &      &\ddots&a_1\cr
	        &a_{n_a}&   &      &a_2\cr
	         &   &\ddots&      &\vdots\cr
	         &0  &      &\ddots&\vdots\cr
	         &   &      &      &a_{n_a}\cr} \mbox{\hspace{1.5cm} ($n_y\times n_x$ matrix)}
\label{matrix}
\eeq
\\
I will consider two kinds of deconvolution:
\bit
\item predictive deconvolution: $(y_i) = (y_{i-1}) * f$, where $f$ is
the unknown predictive filter;
\item deconvolution by a known wavelet: $y = w * r$, $w$ being this 
wavelet, $r$ unknown ``reflection coefficients''.
\eit

\shead{Ill-conditioning of $L^2$ deconvolution processes}
To assure the convergence of the IRLS algorithm, the numerical result of 
each iteration must be close to the numerical limit of the corresponding CG 
algorithm, because it will be used through $W$ as input for the next iteration. 
Moreover, the convolution matrices $A$ and $W^{{1\over 2}}A$ may not be sparse 
(equation~(\ref{matrix})), especially when $n_x\ll n_y$, as in predictive 
deconvolution. Then many operations are required to compute 
the convolutions and correlations in the conjugate-gradient algorithms,  
so the influence of round-off errors increases. A study of the convergence  
of the conjugate-gradient algorithm in this particular case is necessary.

The first iteration of the IRLS algorithm is a least-squares inversion.
Then we can use some results about $L^2$ deconvolution problems. According 
to Szeg\"{o}'s theorem (Ekstrom, 1973), since $A^TA$ is a Toeplitz symmetric
positive matrix, its eigenvalues ($\lambda_1^2,\cdots,\lambda_{n_a}^2$) can 
be expressed relatively to the Fourier transform $\hat{a}$ of the filter $a$.
Especially, if ($a_i$) doesn't suffer from aliasing:
$$\lambda_i^2 \approx |\hat{a}(\omega_i)|^2\:,\mbox{\hspace{1.0cm}}\omega_i = {2(i-1)\pi \over (n_a-1)\Delta t}\mbox{\hspace{1.0cm} ($i=1$ to $n_a$)}\:,$$
where $\Delta t$ is the time sampling of the vectors $y$, $a$ and $x$ (Figure~\ref{power}).
It implies that, if the power spectrum of the filter $a$ has 
some amplitudes near 0, for example if $a$ is band-limited, the problem should 
be ill-conditioned. This is the case in predictive deconvolution, 
where the filter $a$ is the seismic trace itself. Moreover, by oversampling the
problem, we would remove the Nyquist frequency from the last frequency of the 
filter, and create smaller eigenvalues; we are thus increasing again the 
condition number of the problem (Figure~\ref{power}). Intuitively, we come 
closer to an infinite-dimension problem, where $A^TA$ has an infinite set of 
positive eigenvalues, which decrease to 0 (Hilbert-Schmidt theorem for compact 
self-adjoint operators); this limit value causes the ill-conditioning of the 
problem for an infinite dimension.
\plot{power}{3.5in}{\im/Wminpower}
{
Relationship between the power spectrum of the filter $a$ (continuous line)
and the eigenspectrum of the matrix $A^TA$ (crosses). (a) $n_a=512$, 
$\Delta t=4\; msec$, $n_x=17$ (b) $n_a=1024$, $\Delta t=2\; msec$, 
$n_x=17$.
}

In fact, we must be careful with Szeg\"{o}'s theorem. Milinazzo et al. (1987) 
have shown that, even if the power spectrum of the filter $a$ has some 0 values,
the minimum eigenvalue $\lambda_{min}$ of $A^TA$ might not vanish. They use an
asymptotic development of $\lambda_{min}$ versus $n_x$, which effectively 
goes to 0 when $n_x \rightarrow \infty$. Small-dimensioned problems might be 
well-conditioned even if there are zeros in the power spectrum of the 
convolution filter $a$.

Finally, two other remarks. First, adding some white noise increases the value 
of the minimum eigenvalue, and decreases the condition number, as the maximum
eigenvalue is hardly modified. Secondly, if we want to apply CG algorithms to 
least-squares deconvolution, as in the first step of the IRLS algorithm, the 
convergence will be accelerated if the eigenvalues are gathered in groups: this
will not be true with smooth spectra, or very irregular spectra.

\shead{Stability of the IRLS algorithm in deconvolution}
The results concerning the conditioning and the eigenvalues of the equations
~(\ref{iter}) are unfortunately less accurate. The matrix $A^TWA$ is no longer 
Toeplitz. To compute its eigenvalues would require us to know the 
singular-value decomposition of $A$, which increases the cost of the 
computations.

I could only observe that $A^TA$ and $A^TWA$ have more or less the same 
condition number during the IRLS algorithm; consequently, the weighted
least-squares problems are also ill-conditioned if the original power spectrum 
of the filter $a$ is band-limited. However, I have no general 
results concerning the repartition of the eigenvalues of $A^TWA$.

The choice of $\varepsilon$ is not completely free. The synthetic examples I 
used showed that the more ill-conditioned the initial matrix $A^TA$ is, the 
larger $\varepsilon$ should be taken, otherwise the IRLS algorithm does not 
converge.  For example, with a condition number equal to 1000, 
I took $\varepsilon = \mbox{Max}|y|/100$, and the convergence (rate 
1/10000) was reached in 10 iterations; with a condition number equal to 
10000, I took $\varepsilon = \mbox{Max}|y|/10$, even with double 
precision. Increasing $\varepsilon$ makes $W$ closer to the identity matrix, 
in which case the problem corresponds more to a $L^2$ minimization: we would be 
solving a mixed $L^1-L^2$ problem, as I will explain later.

In conclusion, I may expect the convergence of the CG algorithms to be 
slow if the power spectrum of the filter $a$ contains some small values.
Despite the suggestion to limit the number of iterations in order
to cope with the ill-conditioning, I prefer to increase their number, to be 
closer to the numerical limits. However, the speed of convergence is greatly
increased if it is possible to give a first estimate of the unknown $x$: 
that is the case in predictive deconvolution, where the first estimate would 
be the $L^2$-Wiener filter.

\mhead{SYNTHETIC EXAMPLE}
I will illustrate the properties of the $L^1$ deconvolution on a synthetic 
trace, first without noise (``pure trace''), then with noise (several random 
spikes). Two kinds of deconvolution will be considered: deconvolution with a 
known wavelet, and predictive deconvolution. Though spiky noise is not usual 
in real data, it might be found for some kinds of acquisition (with a bad marine
streamer for example) and is representative of the robustness of $L^1$-norm 
processes.

Figure~\ref{synth} is a plot of all the synthetic inputs I used. The time 
sampling is $dt$=4 msec; the traces contain 512 samples. The first 
trace represents the synthetic wavelet; I chose it minimum-phase, in order 
to optimize the result of the standard $L^2$-Wiener predictive deconvolution, 
which I will compare to the $L^1$ deconvolution. Its frequency band is 10-70 
Hz. The second trace represents a spiky sequence of reflection coefficients.
Next comes the convolution between the synthetic wavelet and this spiky 
sequence. Finally, the fourth trace represents the noisy trace, formed 
with five high-amplitude spikes added to the pure trace.
\vspace{1.0cm}
\plot{synth}{3.30in}{\im/Hsynput}
{
Synthetic inputs: (a) Minimum-phase wavelet (b) Reflection sequence \hspace{2.0cm}
(c) Convolved trace (d) Noisy trace. The four traces are represented with the 
same maximum amplitude.
}

\plot{kw12}{3.30in}{\im/Hkw12}
{
Deconvolution with a known wavelet ({\it no noise}): (a) original sequence 
(b) $L^2$ deconvolution (c) $L^1$ deconvolution.
}

\shead{Deconvolution with a known wavelet}
Figure~\ref{kw12} shows the results of the $L^2$ and $L^1$ deconvolutions for 
the pure trace. Because the original wavelet lacked very low and very high 
frequencies, I used a damping factor for both deconvolutions: 1/1000 of the 
autocorrelation $\sum_i|w(i)|^2$ in the $L^2$ algorithm, and 1/1000 of the 
weighted autocorrelation $\sum_iW(i)|w(i)|^2$ at each iteration of the $L^1$ 
algorithm. The results are similar, as no noise was introduced. The ringing 
around the main peaks comes from the lack of high frequencies in the data, 
which forces the output to be convolved with a ``sinc'' function (impulse 
response of a high-cut filter).

Then I did the same process with the noisy trace. The results are presented
on Figure~\ref{nkw12}. Even with a damping factor, the $L^2$ deconvolution
cannot avoid the influence of the noise, because a damping factor is adapted 
to gaussian noise. By giving the same weights to all the residuals ($W=Identity$
$matrix$), it overestimates the importance of the noise bursts, and damages 
the output around these bursts. On the contrary, as expected, the $L^1$ 
deconvolution is insensitive to this noise.
\plot{nkw12}{3.30in}{\im/Hnkw12}
{
Deconvolution with a known wavelet ({\it noisy trace}): (a) original sequence 
(b) $L^2$ deconvolution (c) $L^1$ deconvolution.
}

\shead{Predictive deconvolution}
The problem is now to estimate a predictive filter $f$ with the input traces.
As the input traces have 512 samples, I took the length of the filter equal 
to 50 samples. 

The residuals of $L^2$ and $L^1$ deconvolution on the pure trace are similar 
(Figure~\ref{Pred12}). As the input wavelet is minimum-phase, the $L^2$ 
deconvolution is efficient. However, the $L^1$ deconvolution has still improved 
the result. Supposing the a priori statistical distribution of the residuals 
is {\it exponential} (and not Gaussian as in $L^2$ deconvolution), it 
has forced the smallest values of the $L^2$ residuals to come closer to 0, 
and has increased the highest values. 
\plot{Pred12}{3.30in}{\im/Hpred}
{
Predictive deconvolution ({\it pure trace}): (a) original sequence 
(b) $L^2$ deconvolution (c) $L^1$ deconvolution.
}

In this case, the IRLS algorithm converged quickly, because the initial 
condition number was low ($\approx 1000$): with real data, we can expect this 
number to be larger than $10^4$. I took the initial estimate of the filter
equal to the $L^2$ filter. In Figure~\ref{Mat},
I plotted the matrices $A^TA$, $A^TW^1A$, $A^TW^{30}A$; $W^1$
and $W^{30}$ were respectively computed with the $L^2$ residuals (after the 
first iteration of the IRLS algorithm) and the $L^1$ residuals (after the 
$30^{th}$ iteration of the IRLS algorithm). The non-Toeplitz structure of 
$A^TW^1A$ is evident. However, $A^TW^{30}A$ is close to a Toeplitz structure, 
because W is closer to the Identity matrix (to the factor $\varepsilon$): as 
the algorithm proceeds, more and more residuals become smaller than 
$\varepsilon$, and their weights get all equal to 1/$\varepsilon$.
\plot{Mat}{2.5in}{\im/Hmat}
{
Predictive deconvolution ({\it pure trace}): (a) Matrix $A^TA$
(b) Matrix $A^TWA$ after 1 IRLS iteration (c) Matrix $A^TWA$ after 30
IRLS iterations.
}
%\sideplot{Mat1}{3.3in}{\im/Hmatp1bis}
%{Near-Toeplitz structure of the matrix $A^TW_{L^1}A$; $W_{L^1}$ is computed 
%with the residuals of the last iteration of the algorithm.}
\plot{Pred12n}{8.00in}{\im/Hnpred}
{
Predictive deconvolution ({\it noisy trace}): (a) original sequence 
(b) $L^2$ deconvolution (c) $L^1$ deconvolution (d) perturbation of the pure 
$L^2$ residuals (e) perturbation of the pure $L^1$ residuals.
}

Finally, I applied $L^2$ and $L^1$ predictive deconvolution to the noisy trace.
The residuals are represented in Figure~\ref{Pred12n}; I also plotted the 
difference between these residuals and the residuals of the pure trace.
In both cases, the residuals still contain the original noise bursts. This is 
not surprising, because the predictive filters could not predict them. It is 
more interesting to study the influence of the noise bursts on the rest of the 
trace. The two plots of the perturbation of the residuals after introduction of 
the noise show that the $L^2$ residuals are much more perturbed than the $L^1$
residuals. This means that the $L^2$ predictive filter has tried to remove these
noise bursts, and its modification has of course transformed the rest of the 
trace. On the contrary, the $L^1$ filter, being insensitive to the spikes, has 
hardly been changed: its residuals are not very different from the initial 
ones, except for the noise bursts.  Thus, the $L^1$ deconvolution is 
more reliable than the $L^2$ deconvolution.

\shead{Other families of noise}
I tried the same processing with exponential and gaussian noise. Unfortunately,
the $L^1$ minimization is no more efficient. The noise samples now have various 
amplitudes, about the same repartitions as the original trace, and cannot be 
isolated from the seismic samples by the original criterion of small 
weight: they have all kinds of weighting. The $L^1$ norm itself does not 
improve the result of the $L^2$ deconvolution; the most usual way to suppress 
the noise is to use large damping factors, which is not a typical 
characteristic of the $L^1$ norm. Moreover, even if the noise $n$ is 
gaussian, white, and uncorrelated with the seismic data, the $L^1$ normal
equations cannot use these properties, because the correlations terms ($n^Tn$,
$y^Tn$) will be weighted by the matrix $W$, and replaced by $n^TWn$ and $y^TWn$
(which may not vanish). I will suggest in the last part a method to 
cope simultaneously with two kinds of noise, for example gaussian and spiky 
noise, using $L^1$ and $L^2$ norms simultaneously.

\shead{$L^1$ norm and non minimum-phase wavelets}
A predictive $L^2$ deconvolution produces a minimum-phase inverse filter,
independently of the phase of the wavelet $w$. Thus it would be interesting 
to study the effect of the $L^1$ deconvolution on non minimum-phase inputs.
I used the same input reflection sequence as before, but changed the input
wavelet into a constant phase ($60^o$) wavelet. I applied $L^2$ and $L^1$ 
deconvolution on the new synthetic trace. I initialized the $L^1$ algorithm 
first with the $L^2$ filter, then without the input filter. The results are 
shown on Figure~\ref{60-phase}.
\plot{60-phase}{8.0in}{\im/H60dec}
{
Predictive deconvolution - $60^o$-phase wavelet: (a) Input wavelet (b) Synthetic 
trace (c) $L^2$ residuals (d) $L^1$ residuals, $L^2$ filter as first estimate
(e) $L^1$ residuals, no first estimate.
}

It appears that the $L^1$ deconvolution does not improve the phase properties 
of the output. In one case (starting from $L^2$ filter), the filter is
from the start close to the $L^2$ filter; in the other case, the first 
iteration of the IRLS algorithm forces the filter to be close to the $L^2$
filter. In the first case, the $L^1$ deconvolution results in a sharpening 
of the $L^2$ result, again because the residuals have an a priori exponential 
distribution.

The $L^1$ deconvolution can be related to 
the variable norm deconvolution (Gray, 1979), in the particular case when 
the processing tries to draw the statistical distribution of the time series 
from a gaussian one (seismic trace) to an exponential one (residuals). 
However, $L^1$ and Gray's deconvolutions are slightly different, because the 
IRLS algorithm does not suppose any a-priori distribution for the inputs; the 
normal equations of the variable norm deconvolution are in fact a linear 
combination of the $L^2$ normal equations and of the $L^1$ normal equations.

In conclusion, the IRLS algorithm in predictive deconvolution essentially 
sharpens the residuals. If an initial estimate is used (Wiener filter, Burg
filter, or any minimum-entropy deconvolution filter), the residuals of 
{\it this initial filter} will be sharpened; otherwise, the result will be a 
sharpened version of the $L^2$ residuals. It would be interesting in fact to 
modify the first iteration of the algorithm, but would it still guarantee the
convergence?
\mhead{EXAMPLE ON REAL DATA}
I applied $L^1$ deconvolution on a marine CMP gather from BP Alaska.
This gather, recorded in South California, is composed of 36 traces;
the near-offset is 65.7 m, the intertrace distance 33.3 m.
The time window I used is 0.7-3.0 sec; the time sampling is 4 msec.
The CMP gather is presented in Figure~\ref{Hreal}, along with the 
residuals of the $L^2$ deconvolution. 
For graphics convenience, I only present a time window between 0.7 and 2 
sec. The residuals were computed with
the same filter for all the traces, composed of 50 samples; I used a
small prewhitening factor (5 \%), in accordance to the small level of noise 
before the first arrival. It appears on the first arrival that the seismic 
wavelet has a precursor, making it clearly non minimum-phase; the source
was composed of two water-guns.

I applied $L^1$ deconvolution on this gather, with the $L^2$ predictive 
filter as initial estimate; I computed one 50-sample filter for all traces. The 
residuals of this $L^1$ deconvolution are presented in Figure~\ref{Hreal1},
with also the difference between the $L^2$ residuals and the $L^1$ residuals.
As I expected, the $L^1$ residuals are a sharpened version of the $L^2$ 
residuals: the largest amplitudes have been increased, the smallest have been
decreased. This is not really obvious on the sections, but it appeared clearly
when I computed the kurtosis of the traces in the central part of the gather
(between 1.5 and 2.5 sec, to avoid the strong water-bottom reflections): 
the kurtosis of the $L^1$ residuals were larger than those of the $L^2$ 
residuals. It also appears on the difference section when we look at the 
water-bottom reflection. However, the estimation of the phase of the seismic 
wavelet has not been improved, and especially the precursor has not been 
removed, but accentuated.

This sharpening of the residuals presents the advantage of increasing the 
contrast between reflectors, and stressing the main reflectors. Moreover,
as this gather presents some lateral variations of amplitude, these 
variations will be enhanced by the sharpening of the residuals.

I also tried a $L^p$ deconvolution with $p=0.1$, though the objective function 
is no longer a norm. However, with the $L^2$ filter as initial estimate, the 
algorithm converged. As expected, the result was once again a sharpening of 
the $L^2$ residuals, but it was very similar to the result of the $L^1$ 
deconvolution.
\plot{Hreal}{8.0in}{\im/Hreal}
{
Marine CMP gather; 36 traces, 0.7 to 2 seconds. Bottom: original cmp gather. 
Top: residuals of the $L^2$ deconvolution (filter length: 200 msec). Courtesy 
of BP Alaska.
}
\plot{Hreal1}{8.0in}{\im/Hreal1}
{
The same marine CMP gather as in Figure 9. Bottom: residuals of the $L^1$ 
deconvolution. Top: difference between the $L^2$ residuals and the $L^1$ 
residuals ($L^2$ minus $L^1$), amplitudes multiplied by 1.5 .
}
\mhead{MIXED $L^1-L^2$ NORM MINIMIZATIONS}
Though the theoretical problem is a $L^p$ norm minimization, the introduction of 
the parameter $\varepsilon$ changes the theoretical resolution. In fact, the limit 
$x_{lim}$ of the algorithm will solve the normal equations:
\beq
A^TW_{\varepsilon}A.x_{lim} = A^TW_{\varepsilon}.y 
\label{real}
\eeq
where:
$$ r(i) = (y - A.x)(i) $$
\beq
 W_{\varepsilon}(i) = \left\{ \begin{array}{ll} 
		|r(i)|^{p-2} & \mbox{if $|r(i)|>\varepsilon$} \\
		\varepsilon^{p-2}    & \mbox{if $|r(i)|\leq \varepsilon$}
		\end{array}
	\right.
\label{eps}
\eeq
These equations are the normal equations of the next minimization problem:
\beq
\left\{ \begin{array}{ll}
	 {\rm Min} \mbox{\hspace{0.5cm}} (\varepsilon^{p-2}/2)\sum_{i/|r(i)|<\varepsilon}|r(i)|^2 + {1\over p}\sum_{i/|r(i)|>\varepsilon}|r(i)|^p \\
	 r(i) = (y - A.x)(i) \;.
	\end{array}
	\right.
\label{mixte}
\eeq

Effectively, if $|r(i)|<\varepsilon$, a small perturbation of $x$ will 
maintain this inequality, because of the continuity of the matricial product.
This property also holds for the other inequality: $|r(i)|>\varepsilon$.
It is no longer true if we take large inequalities ($\leq$ or $\geq$).
So, we can calculate the gradient of the objective function in equation~(\ref{mixte}) 
and keep the same inequalities; if I call this function ${\cal N}_{mix}$, then:
\beq
{\partial{\cal N}_{mix} \over\partial x_k} = -\varepsilon^{p-2}\:\sum_{i/|r(i)|<\varepsilon} A_{ik}r(i) -\sum_{i/|r(i)|>\varepsilon} A_{ik}r(i)|r(i)|^{p-2} \;.
\eeq 
Finally, replacing $r(i)$ by $(y - A.x)(i)$, we find the normal equations~(\ref{real}).
This result is true if the objective function ${\cal N}_{mix}$
is minimized for a vector $x$ such that none of the residuals $|r(i)|$ is 
equal to $\varepsilon$. 

So, the IRLS algorithm minimizes what would be called a {\it mixed} function. 
But this objective function ${\cal N}_{mix}$ is not convex: the summation 
of any two vectors $x$ and $x'$ can completely modify the repartition between 
residuals larger or smaller than $\varepsilon$ if these vectors are not close
to each other. In fact, this function is strictly convex inside the regions 
limited by the $2*n_y$ hyperplanes:
$$ r(i)=\varepsilon \mbox{\hspace{1.0cm}and \hspace{1.0cm}} r(i)=-\varepsilon \;,$$
and it is not continuous on these hyperplanes. On 
Figure~\ref{mix}, I plotted for example the function: 
$${\cal N}_{mix}(x_1,x_2) = {1\over 4}\sum_{i/|x_i|<2}|x_i|^2 + \sum_{i/|x_i|>2}|x_i|\;,$$
for $|x_i|<4$, obtained with $p$=1 and $\varepsilon$=2. This kind of function
has one global minimum, and no other local minimum. However, if we cross the 
hyperplanes to reach the region of the global minimum, there might be problems
of computation of the gradient near these hyperplanes, and so problems of 
convergence. This problem does not appear with the IRLS algorithm and the 
$L^1$ norm, because $\varepsilon$ is small: the central region is smaller, 
and the discontinuity (whose jump is proportional to $\varepsilon$) becomes 
negligible. In fact, the fewer hyperplanes are crossed during the algorithm,
the more efficient the convergence will be. Thus, it is better to have 
from the beginning as many residuals as possible smaller than $\varepsilon$;
that is why the use of a first estimate of the filter enhances the convergence,
and too small an $\varepsilon$ hinders it.

There is in fact another reason which explains the slowness of the convergence
of the IRLS algorithm for a small $\varepsilon$. A {\it theoretical }
$n$-component $L^1$ solution must produce at least $n$ residuals equal to 
zero. But during the algorithm, if any of the residuals gets 
smaller than $\varepsilon$, it creates a weight $\varepsilon^{p-2}$ for the 
next iteration. If $\varepsilon$ is small, this weight will be relatively
large, and will force the same residual, at the next iteration, to be smaller 
than $\varepsilon$. Thus, the algorithm tries to keep this residual small, 
independently of whether or not it belongs to the set of theoretical zero 
residuals, and we might need many iterations to force the algorithm to 
``forget'' this residual.
\plot{mix}{4.3in}{\im/Hmix}
{
Mixed function ${\cal N}_{mix}$ for $y=(y_1,y_2)=0$, $A$=Identity matrix, 
$x=(x_1,x_2)$, $p=1$, with $\varepsilon$=2 ($x_1$ and $x_2$ range from -4 to 4).
}

The objective function ${\cal N}_{mix}$ could be interesting for a mixed 
$L^1-L^2$ problem, for example, if $\varepsilon$ were taken large enough, 
in the minimization problem:
\beq
\left\{ \begin{array}{ll}
	 {\rm Min}\mbox{\hspace{0.5cm}} (1/2\varepsilon)\sum_{i/|r(i)|<\varepsilon}|r(i)|^2 + \sum_{i/|r(i)|>\varepsilon}|r(i)| + \alpha\sum_{i}|x(i)|^2\\
	 r(i) = (y - A.x)(i) \;.
	\end{array}
	\right.
\label{final}
\eeq
The convergence would be ensured by the proximity of the $L^2$ minimization 
problem. Problems of computation of the gradient on the hyperplanes could be 
solved by smoothing the objective function near these hyperplanes.
Such a smoothing could be assured by taking:
$$W(i) = (|r(i)|^{2-p}+\varepsilon)^{-1}\;,$$
but this choice would not actually correspond to the same minimization problem.
An intermediate choice would be to assure only the continuity of the objective 
function ${\cal N}_{mix}$ by taking:
$${\cal N}_{mix}(x) = {1\over 2}\sum_{i/|r(i)|\leq\varepsilon}(|r(i)|^2/\varepsilon + \varepsilon) + \sum_{i/|r(i)|>\varepsilon}|r(i)| \;.$$
The {\it mixed} function of the problem~(\ref{final}) could handle two kinds 
of noise at the same time: spiky high-amplitude noise with the $L^1$ norm, 
and gaussian noise with the $L^2$ norm and the damping factor $\alpha$. 

\mhead{CONCLUSIONS}
$L^1$ deconvolutions with IRLS algorithm are especially efficient and robust 
in the presence of high-amplitude noise bursts. In the general case, 
their residuals are a sharpened version of the $L^2$ residuals, as in 
predictive deconvolution: the largest amplitudes are increased, the smallest
are decreased, and thus the contrast between small and large events is 
heightened. However, we cannot expect many improvements in
the estimation of the phase of the seismic wavelet. Finally, this 
algorithm could be used for minimizations with mixed $L^1-L^2$ norms, and it 
would thus cope with the simultaneous presence of different kinds of noise.

\mhead{ACKNOWLEDGEMENTS}
I would like to thank Clement Kostov and Bill Harlan for many fruitful 
discussions and suggestions about deconvolution processes. I am also grateful
to SNEA(P) for supporting my study at Stanford.

\REF
\reference{Byrd, R.A., Payne, D.A., 1979, Convergence of the IRLS algorithm 
for robust regression: Technical Report 313, John Hopkins University 
(unpublished)}
\reference{Claerbout, J.F., Muir, F., 1973, Robust modeling with erratic data:
Geophysics, {\bf 38}, pp. 826-844.}
\reference{Ekstrom, M.A., 1973, A spectral characterization of the 
ill-conditioning in numerical deconvolution: IEEE Trans. Audio Electroacoust.,
vol. AU-{\bf 21}, pp. 344-348.}
\reference{Gersztenkorn, A., Bednar, J.B., Lines, L.R., 1986, Robust iterative
inversion for the one-dimensional acoustic wave equation: Geophysics, {\bf 51},
pp. 357-368.}
\reference{Gray, W.C., 1979, Variable norm deconvolution: SEP {\bf 19}, Ph.D. 
Thesis, Stanford University.}
\reference{Milinazzo, F., Zala, C., Barrodale, I., 1987, On the rate of growth 
of condition numbers for convolution matrices: IEEE Trans. on Acoustics, 
Speech, and Signal Processing, ASSP-{\bf 35}, pp. 471-475.}
\reference{Scales, J.A., Gersztenkorn, A., Treitel, S., 1988, Fast Lp solution 
of large, sparse, linear systems - Application to seismic travel time 
tomography: Journal of Computational Physics, {\bf 75}, pp. 314-333.}
\reference{Taylor, H.L., Banks, S.C., McCoy, J.F., 1979, Deconvolution with 
the L1 norm: Geophysics, {\bf 44}, pp. 39-52.}
\reference{Yarlagadda, R., Bednar, J.B., Watt, T.L., 1985, Fast algorithms 
for Lp deconvolution: IEEE Trans. on Acoustics, Speech, and Signal Processing, 
ASSP-{\bf 33}, pp. 174-182.}