The Needleman-Wunsch (NW) algorithm Needleman and Wunsch (1970) is a method that was developed for amino acid sequence alignment in proteins. This was the first of many important alignment techniques which now find application in the Human Genome Project.
Human DNA consists of some 30,000 genes which are in turn composed of 20 amino acids represented by letters of a reduced alphabet (ADCEFGHILKMNPQRSTVWY). The total genome is composed of about 3 billion letters, or 100,000 per gene. Finding where a particular string of amino acids fits on a protein is an optimization problem that aims to find the optimal alignment of two character strings with respect to a defined set of rules and parameter values for comparing different alignments.
In terms of dynamic programming, we have two strings of length n and m that we would like to align. The key observation of the algorithm, and what makes it a computationally attractive process, is that the strings can be compressed and stretched but must remain in order. As a result, when comparing the two strings we have only to evaluate three possible states: a match, a compression, or a dilation.
To implement the NW algorithm we start by constructing a `similarity' matrix of size n by m. In the genome example, the matrix is going to have binary values: either the amino acid matches (1) or it doesn't (). We then construct our score matrix. The optimality of the path is defined by our similarity measurement . At each point in the scoring matrix, we evaluate whether the optimal path is from a point below (E), a point to the right (F), or a point along the diagonal (G). Our preferred solution is to move along the diagonal (we want to give preference to finding as much similarity between the two strings as possible) so we penalize the other two options. The scoring matrix (V) will then be recursively built by selecting the maximum of E, F, and G. We can write this in terms of the following equations,
The NW algorithm has some nice properties. The cost is relatively low, n*m*3. In addition, it can handle very discontinuous traces. This can also be a disadvantage, sometimes we would like to force a greater level of continuity than the gap penalties allow.