Knuth-Morris-Pratt algorithm
|
The Knuth-Morris-Pratt string searching algorithm searches for occurrences of a "pattern" string <math>P<math> within a main string <math>S<math> by employing the simple observation that when a mismatch occurs, we have enough knowledge simply by possessing the pattern to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
The algorithm was invented by Knuth and Pratt and independently by J. H. Morris in 1976.
Contents |
1.1 The search algorithm |
The KMP algorithm
To motivate the technical details, we consider a specific instance. In this article, we will be using zero-based arrays for our strings; thus the 'C' in <math>P<math> will be denoted <math>P[2]<math>. Let <math>m<math> be the start of the currently matched substring within <math>S<math> and <math>i<math> the position within <math>P<math>.
m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE P: ABCDABD i: 0123456
We begin a step by matching each character one after another. Thus in the fourth step, <math>m = 0<math> and <math>i = 3<math>. But <math>S[3]<math> is a space and <math>P[3] = 'D'<math>, so we have a mismatch. Rather than beginning again at <math>m = 1<math>, we note that no 'A' occurs between positions 0 and 3 in <math>P<math> except at 0; hence, having checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again. Therefore we move on to the next character, setting <math>m = 4<math> and <math>i = 0<math>.
m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE P: ABCDABD i: 0123456
We quickly obtain a nearly complete match 'ABCDAB' when, at <math>i = 6<math>, we again have a discrepancy. However, just prior to the end of the current partial match, we passed an 'AB' which could be the beginning of a new match, so we must take this into consideration. As we already know that these characters match the two characters prior to the current position, we need not check them again; we simply reset <math>m = 8<math>, <math>i = 2<math> and continue matching the current character.
m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE P: ABCDABD i: 0123456
This search fails immediately, however, as the pattern still does not contain a space, so as in the first trial, we increment <math>m = 11<math>, reset <math>i = 0<math>, and begin searching anew.
m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE P: ABCDABD i: 0123456
Once again we immediately hit upon a match 'ABCDAB' but the next character, 'C', does not match the final character 'D' of the pattern. Reasoning as before, we set <math>m = 15<math>, to start at the two-character string 'AB' leading up to the current position, set <math>i = 2<math>, and continue matching from the current position.
m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE P: ABCDABD i: 0123456
This time we are able to complete the match, and return the position 15 as its origin.
The search algorithm
The above example is completely instructive in this regard. We assume the existence of a "partial match" table, described below, which indicates where we need to look for the start of a new match in the event that the current one ends in a mismatch. For the moment the table, <math>T<math>, should be taken as a black box with the property that if we have a match starting at <math>S[m]<math> that fails when comparing <math>S[m + i]<math> to <math>P[i]<math>, then the next possible match will start at <math>m + i - T[i - 1]<math>. In particular, <math>T[-1]<math> is defined and equals <math>-1<math>. Knowing this, the algorithm is very simple:
- Let <math>i = m = 0<math>; say <math>P<math> has <math>n<math> characters, and <math>S<math>, <math>l<math> characters.
- If <math>m + i = l<math>, then we exit; there is no match. Otherwise, compare <math>P[i]<math> versus <math>S[m + i]<math>.
- If they are equal, set <math>i = i + 1<math>. If <math>i = n<math> then we have found a full match; terminate the algorithm and return <math>m<math> as the start of the match.
- If they are unequal, let <math>e = T[i - 1]<math>. Set <math>m = m + i - e<math> and if <math>i > 0<math>, set <math>i = e<math>.
- Return to step 2.
This clearly implements the algorithm performed in the example; at each mismatch, we consult the table for the next known possible beginning of a new match and update our counters accordingly. Thus we need never backtrack; in particular, each character is matched only once (though potentially mismatched many times; an analysis of the efficiency of this algorithm is given below).
A code example of the search algorithm
Sample C code for an implementation of this algorithm is given below. In order to satisfy the natural limitations on the bounds of C arrays, <math>T[i]<math> in the code is equivalent to <math>T[i + 1]<math> above.
int kmp_search(char *P, char *S) { extern int T[]; int m = 0; int i = 0; while (S[m + i] != '\0' && P[i] != '\0') { if (S[m + i] == P[i]) { ++i; } else { m += i - T[i]; if (i > 0) i = T[i]; } } if (P[i] == '\0') { return m; } else { return m + i; } }
The efficiency of the search algorithm
Assuming the prior existence of the table <math>T<math>, the search portion of the Knuth-Morris-Pratt algorithm has complexity O<math>(l)<math>, where <math>l<math> is the length of <math>S<math>. As except for the fixed overhead incurred in entering and exiting the function, all the computations are performed in the central loop, we will calculate a bound on the number of iterations of this loop; in order to do this we first make a key observation about the nature of <math>T<math>. By definition it is constructed so that if a match which had begun at <math>S[m]<math> fails while comparing <math>S[m + i]<math> to <math>P[i]<math>, then the next possible match must begin at <math>S[m + (i - T[i])]<math>. In particular the next possible match must occur at a higher index than <math>m<math>, so that <math>T[i] < i<math>.
Using this fact, we will show that the loop can execute at most <math>l<math> times. For in each iteration, it executes one of the two branches of the if statement. The first branch invariably increases <math>i<math> and does not change <math>m<math>, so that the index <math>m + i<math> of the currently scrutinized character of <math>S<math> is increased. The second branch adds <math>i - T[i]<math> to <math>m<math>, and as we have seen, this is always a positive number. Thus the location <math>m<math> of the beginning of the current potential match is increased. Now, the loop ends if <math>S[m + i] = '\backslash 0'<math>, which following the C convention that a null character denotes the end-of-string, means that <math>m + i = l<math>. Therefore each branch of the if statement can be reached at most <math>l<math> times, since they respectively increase either <math>m + i<math> or <math>m<math>, and <math>m \leq m + i<math> so that if <math>m = l<math>, then certainly <math>m + i \geq l<math>, so that since it increases by unit increments at most, we must have had <math>m + i = l<math> at some point in the past.
Thus the loop executes at most <math>2l<math> times, showing that the time complexity of the search algorithm is <math>O(l)<math>.
The "partial match" table
The goal of the table is to allow the algorithm not to check any character of the main string more than once. The key observation about the nature of a linear search that allows this to happen is that in having checked some segment of the main string against an initial segment of the pattern, we know exactly at which places a new potential match which could continue to the current position could begin prior to the current position. In other words, we "pre-search" the pattern itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not sacrificing any potential matches in doing so.
We want to be able to look up, for each pattern position, the length of the longest possible initial match that ends in the current position other than the full match that probably just failed. Hence <math>T[i]<math> is exactly the length of the longest possible initial segment of <math>P<math> which is also a terminal segment of the substring ending at <math>P[i]<math>. We use the convention that the empty string has length 0. Since a mismatch at the very start of the pattern is a special case (there is no possibility of backtracking), we set <math>T[-1] = -1<math>, as discussed above.
We consider the example of 'ABCDABD' first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons. We set <math>T[-1] = 0<math>. Since <math>P[0]<math> is at the end of only the complete initial segment, we set <math>T[0] = 0<math> as well. To find <math>T[1]<math>, we must discover a proper terminal substring of 'AB' which is also an initial substring of the pattern. But the only proper terminal substring of 'AB' is 'B', which is not an initial substring of the pattern, so we set <math>T[1] = 0<math>.
Continuing to 'C', we note that there is a shortcut to checking all terminal substrings: let us say that we discovered a terminal substring ending at 'C' with length 2; then its first character is an initial substring of an initial substring of the pattern, hence an initial substring itself...and it ends at 'B', which we already determined cannot occur. Therefore we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so <math>T[2] = 0<math>.
Similarly for 'D', giving <math>T[3] = 0<math>, so we pass to the subsequent 'A'. The same logic shows that the longest substring we need consider has length 1, and in this case 'A' does work; thus <math>T[4] = 1<math>. Considering now the following 'B', we exercise the following logic: if we were to find a subpattern beginning before the previous 'A' yet continuing to the current 'B', then in particular it would itself have a proper initial segment ending at 'A' yet beginning before 'A', which contradicts the fact that we already found that 'A' itself is the earliest occurrence of a proper subpattern ending at it. Therefore we need not look before that 'A' to find a subpattern for 'B'. In fact, checking it, we find that it continues to 'B' and that 'B' is the second entry of the substring of which 'A' is the first. Therefore the entry for 'B' in 'T' is one more than the entry for 'A', namely <math>T[5] = 2<math>.
Finally, the pattern does not continue from 'B' to 'D'. Similar reasoning as above shows that if we were to find a subpattern longer than 1 at 'D' then it must comprise a subpattern ending at 'B'; since the current one does not work, this one would have to be shorter. But the current one is an initial segment of the pattern ending at the second position, so this potential new subpattern would be as well, and we already decided there were none of those. Since 'D' itself is not a subpattern, <math>T[6] = 0<math>.
Therefore we compile the following table:
<math>i<math> | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|---|---|
<math>P[i]<math> | A | B | C | D | A | B | D | |
<math>T[i]<math> | -1 | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
The table-building algorithm
The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. Here follows the algorithm; to eliminate special cases we will use the convention that <math>P[-1]<math> is defined and its value is unequal to any possible character in <math>P<math>.
- Set <math>T[-1] = -1<math>, and let <math>P<math> have <math>n<math> characters.
- Set <math>i = 0<math>, <math>j = T[i - 1]<math>.
- If <math>i = n<math> then we are done; terminate the algorithm. Otherwise, compare <math>P[i]<math> with <math>P[j]<math>.
- If they are equal, set <math>T[i] = j + 1<math>, <math>j = j + 1<math>, and <math>i = i + 1<math>.
- If not, and if <math>j > 0<math>, set <math>j = T[j - 1]<math>.
- Otherwise, set <math>T[i] = 0<math>, <math>i = i + 1<math>, <math>j = 0<math>.
- Return to step 3.
A code example of the table-building algorithm
A C code example for the table-building algorithm is given here. As for the search algorithm, the bounds of <math>T<math> have been incremented by 1 to make the C code more natural. The additional variable <math>c<math> is employed to simulate the existence of <math>P[-1]<math>. It is also assumed that both this routine and the search routine are called as subroutines of a wrapper which correctly allocates memory for <math>T<math>.
void kmp_table(char *P) { extern int T[]; int i = 0; int j = -1; char c = '\0'; T[0] = j; while (P[i] != '\0') { if (P[i] == c) { T[i + 1] = j + 1; ++j; ++i; } else if (j > 0) { j = T[j]; } else { T[i + 1] = 0; ++i; j = 0; } c = P[j]; } return; }
The efficiency of the table-building algorithm
The complexity of the table algorithm is <math>O(n)<math>, where <math>n<math> is the length of <math>P<math>. As except for some initialization all the work is done in step 3, it is sufficient to show that step 3 executes in <math>O(n)<math> time, which will be done by simultaneously examining the quantities <math>i<math> and <math>i - j<math>. In the first branch, <math>i - j<math> is preserved, as both <math>i<math> and <math>j<math> are incremented simultaneously, but naturally, <math>i<math> is increased. In the second branch, <math>j<math> is replaced by <math>T[j - 1]<math>, which we saw above is always strictly less than <math>j<math>, thus increasing <math>i - j<math>. In the third branch, <math>i<math> is incremented and <math>j<math> is not, so both <math>i<math> and <math>i - j<math> increase. Since <math>i \geq i - j<math>, this means that at each stage either <math>i<math> or a lower bound for <math>i<math> increases; therefore since the algorithm terminates once <math>i = n<math>, it must terminate after at most <math>n<math> iterations of step 3, since <math>i - j<math> begins at <math>1<math>. Therefore the complexity of the table algorithm is <math>O(n)<math>.
The efficiency of the KMP algorithm
Since the two portions of the algorithm have, respectively, complexities of <math>O(l)<math> and <math>O(n)<math>, the complexity of the overall algorithm is O<math>(n + l)<math>.
External links
- An explanation of the algorithm (http://www.ics.uci.edu/~eppstein/161/960227.html)es:Algoritmo Knuth-Morris-Pratt