The Needleman-Wunsch (NW) algorithm Needleman and Wunsch (1970) is a method that was developed for amino acid sequence alignment in proteins. This was the first of many important alignment techniques which now find application in the Human Genome Project.

Human DNA consists of some 30,000 genes which are in turn composed of 20 amino acids represented by letters of a reduced alphabet (ADCEFGHILKMNPQRSTVWY). The total genome is composed of about 3 billion letters, or 100,000 per gene. Finding where a particular string of amino acids fits on a protein is an optimization problem that aims to find the optimal alignment of two character strings with respect to a defined set of rules and parameter values for comparing different alignments.

In terms of dynamic programming, we have two strings of length *n* and *m*
that we would like to align.
The key observation of the algorithm, and what
makes it a computationally attractive process, is that
the strings can be compressed and stretched but must remain in order.
As a result, when comparing the two strings we have only
to evaluate three possible states: a match, a compression, or
a dilation.

To implement
the NW algorithm we start by constructing a `similarity' matrix
of size *n* by *m*.
In the genome example,
the matrix is going to have binary values:
either the amino acid matches (1) or it doesn't ().
We then construct our score matrix.
The optimality of the path is defined
by our similarity measurement . At each point in the scoring
matrix, we evaluate whether the optimal path is
from a point below (*E*), a point to the right (*F*), or a point along the diagonal
(*G*). Our preferred solution is to move along the diagonal (we want to give
preference to finding as much similarity between the two strings as possible)
so we penalize the other two options.
The scoring matrix (*V*) will then be recursively built by
selecting the maximum of *E*, *F*, and *G*. We can write this
in terms of the following equations,

(2) | ||

(3) | ||

(4) | ||

(5) |

The NW algorithm has some nice properties.
The cost is relatively low,
*n***m**3.
In addition, it can handle very discontinuous traces.
This can also be a disadvantage, sometimes we would
like to force a greater level of continuity than
the gap penalties allow.

5/23/2004