Human DNA consists of some 30,000 genes which are in turn composed of 20 amino acids represented by letters of a reduced alphabet (ADCEFGHILKMNPQRSTVWY). The total genome is composed of about 3 billion chemical base pairs, or about 100,000 per gene. Finding where a particular string of amino acids fits is an optimization problem that aims to find the optimal alignment of the two strings with respect to a defined set of rules and parameter values for comparing different alignments.
The algorithm is an iterative method in which all possible pairs of amino acids (one from each string) are set up in a 2D matrix and alignments are represented as pathways through this array. The optimum alignment is the path (or paths) connecting maximum scoring values. This approach is an example of dynamic programming, which has also been applied to seismic modeling Darby and Neidell (1966) and travel time computation Schneider et al. (1992).
It is a global optimization process which yields a solution to the problem of pairwise alignment, meaning that we are interested in finding the best fit between only two strings. If alignment of more than two strings is of interest, the problem can be broken down into a cascade of pairwise alignments and thus solved.
In its simplest form, the Needleman-Wunsch algorithm can be summarized by Figure 1. A matrix is formed by placing the two strings, possibly of different length, along the left column and top row. In this step a one is allocated to a cell in the matrix if the letter in each list at this location is the same. Otherwise no entry is made (which is a defacto zero). It is at this stage that the letter-alignment problem becomes purely numerical. In fact, the original string could just as easily consist of integers as letters. The result of this process is the similarity matrix in Figure 1a.
From the similarity matrix a scoring matrix is formed beginning in the lower right corner. The procedure is to add the score value to the maximum value in a row-column pair whose upper left corner is down and to the right of the current working position. Thus in Figure 1b the similarity value 1 is added to the maximum value in the blackened cells (also 1) to give a score of 2. Figure 1c is a later stage of the computation, which continues up and to the left until every cell has been visited and the scoring matrix is complete, Figure 1d. In this simple form, a final score corresponds to how many character matches exist in the optimum alignment.
The final step (traceback) operates by starting at the highest score value (8 in this case) and determining the maximum score path by moving to the right, down, or diagonally down and to the right, Figure 1e. The fact that more than one 8 score alignment exists (Figure 1f) is an expression of non-uniqueness. An important aspect of the solution is that in the process of finding the best global alignment, we also find the best alignments of any sublength.