next up previous print clean
Next: CONCLUSIONS Up: Gilles Darche: L Deconvolution Previous: EXAMPLE ON REAL DATA

MIXED L1-L2 NORM MINIMIZATIONS

Though the theoretical problem is a Lp norm minimization, the introduction of the parameter $\varepsilon$ changes the theoretical resolution. In fact, the limit xlim of the algorithm will solve the normal equations:  
 \begin{displaymath}
A^TW_{\varepsilon}A.x_{lim} = A^TW_{\varepsilon}.y\end{displaymath} (10)
where:

r(i) = (y - A.x)(i)

 
 \begin{displaymath}
W_{\varepsilon}(i) = \left\{ \begin{array}
{ll} 
 \vert r(i)...
 ...mbox{if $\vert r(i)\vert\leq \varepsilon$}
 \end{array} \right.\end{displaymath} (11)
These equations are the normal equations of the next minimization problem:  
 \begin{displaymath}
\left\{ \begin{array}
{ll}
 {\rm Min} \mbox{\hspace{0.5cm}} ...
 ...rt r(i)\vert^p \\  r(i) = (y - A.x)(i) \;.
 \end{array} \right.\end{displaymath} (12)

Effectively, if $\vert r(i)\vert<\varepsilon$, a small perturbation of x will maintain this inequality, because of the continuity of the matricial product. This property also holds for the other inequality: $\vert r(i)\vert\gt\varepsilon$.It is no longer true if we take large inequalities ($\leq$ or $\geq$). So, we can calculate the gradient of the objective function in equation (12) and keep the same inequalities; if I call this function ${\cal N}_{mix}$, then:
\begin{displaymath}
{\partial{\cal N}_{mix} \over\partial x_k} = -\varepsilon^{p...
 ...rt r(i)\vert\gt\varepsilon} A_{ik}r(i)\vert r(i)\vert^{p-2} \;.\end{displaymath} (13)
Finally, replacing r(i) by (y - A.x)(i), we find the normal equations (10). This result is true if the objective function ${\cal N}_{mix}$is minimized for a vector x such that none of the residuals |r(i)| is equal to $\varepsilon$.

So, the IRLS algorithm minimizes what would be called a mixed function. But this objective function ${\cal N}_{mix}$ is not convex: the summation of any two vectors x and x' can completely modify the repartition between residuals larger or smaller than $\varepsilon$ if these vectors are not close to each other. In fact, this function is strictly convex inside the regions limited by the 2*ny hyperplanes:

\begin{displaymath}
r(i)=\varepsilon \mbox{\hspace{1.0cm}and \hspace{1.0cm}} r(i)=-\varepsilon \;,\end{displaymath}

and it is not continuous on these hyperplanes. On Figure [*], I plotted for example the function:

\begin{displaymath}
{\cal N}_{mix}(x_1,x_2) = {1\over 4}\sum_{i/\vert x_i\vert<2}\vert x_i\vert^2 + \sum_{i/\vert x_i\vert\gt 2}\vert x_i\vert\;,\end{displaymath}

for |xi|<4, obtained with p=1 and $\varepsilon$=2. This kind of function has one global minimum, and no other local minimum. However, if we cross the hyperplanes to reach the region of the global minimum, there might be problems of computation of the gradient near these hyperplanes, and so problems of convergence. This problem does not appear with the IRLS algorithm and the L1 norm, because $\varepsilon$ is small: the central region is smaller, and the discontinuity (whose jump is proportional to $\varepsilon$) becomes negligible. In fact, the fewer hyperplanes are crossed during the algorithm, the more efficient the convergence will be. Thus, it is better to have from the beginning as many residuals as possible smaller than $\varepsilon$;that is why the use of a first estimate of the filter enhances the convergence, and too small an $\varepsilon$ hinders it.

There is in fact another reason which explains the slowness of the convergence of the IRLS algorithm for a small $\varepsilon$. A theoretical n-component L1 solution must produce at least n residuals equal to zero. But during the algorithm, if any of the residuals gets smaller than $\varepsilon$, it creates a weight $\varepsilon^{p-2}$ for the next iteration. If $\varepsilon$ is small, this weight will be relatively large, and will force the same residual, at the next iteration, to be smaller than $\varepsilon$. Thus, the algorithm tries to keep this residual small, independently of whether or not it belongs to the set of theoretical zero residuals, and we might need many iterations to force the algorithm to ``forget'' this residual.

The objective function ${\cal N}_{mix}$ could be interesting for a mixed L1-L2 problem, for example, if $\varepsilon$ were taken large enough, in the minimization problem:  
 \begin{displaymath}
\left\{ \begin{array}
{ll}
 {\rm Min}\mbox{\hspace{0.5cm}} (...
 ...ert x(i)\vert^2\\  r(i) = (y - A.x)(i) \;.
 \end{array} \right.\end{displaymath} (14)
The convergence would be ensured by the proximity of the L2 minimization problem. Problems of computation of the gradient on the hyperplanes could be solved by smoothing the objective function near these hyperplanes. Such a smoothing could be assured by taking:

\begin{displaymath}
W(i) = (\vert r(i)\vert^{2-p}+\varepsilon)^{-1}\;,\end{displaymath}

but this choice would not actually correspond to the same minimization problem. An intermediate choice would be to assure only the continuity of the objective function ${\cal N}_{mix}$ by taking:

\begin{displaymath}
{\cal N}_{mix}(x) = {1\over 2}\sum_{i/\vert r(i)\vert\leq\va...
 ...on) + \sum_{i/\vert r(i)\vert\gt\varepsilon}\vert r(i)\vert \;.\end{displaymath}

The mixed function of the problem (14) could handle two kinds of noise at the same time: spiky high-amplitude noise with the L1 norm, and gaussian noise with the L2 norm and the damping factor $\alpha$.


next up previous print clean
Next: CONCLUSIONS Up: Gilles Darche: L Deconvolution Previous: EXAMPLE ON REAL DATA
Stanford Exploration Project
1/13/1998