Boyer-Moore string search algorithm
|
The Boyer-Moore string search algorithm is a particularly efficient string searching algorithm. It was developed by Bob Boyer and J Strother Moore in 1977. The algorithm preprocesses the target string that is being searched for, but not the string being searched (unlike some algorithms which preprocess the string to be searched, and can then amortize the expense of the preprocessing by searching repeatedly). The execution time of the Boyer-Moore algorithm can actually be sub-linear, it doesn't need to actually check every character of the string to be searched but rather skips over some of them. Generally the algorithm gets faster as the key being searched for becomes longer. Its efficiency derives from the fact that, with each unsuccessful attempt to find a match between the search string and the text it's searching in, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string could not match.
How the algorithm works
What people frequently find surprising about the Boyer-Moore algorithm when they first encounter it is that its verifications -- its attempts to check whether a match exists at a particular position -- work backwards. If it starts a search at the beginning of a text for the word "WIKIPEDIA", for instance, it checks the ninth position of the text to see if it contains an "A". If it finds the "A", it moves to the eighth position to see if that contains the last "I" of the word, and so on until it checks the first position of the text for a "W".
Why Boyer-Moore takes this backward approach is clearer when we consider what happens if the verification fails -- for instance, if instead of an "A" in the ninth position, we find an "X". The "X" doesn't appear anywhere in "WIKIPEDIA", and this means there is no match for the search string at the very start of the text -- or at the next eight positions following it, since those would all fall across the "X" as well. After checking just one character, we're able to skip ahead and start looking for a match starting at the tenth position of the text, just after the "X".
This explains why the best-case performance of the algorithm, for a text of length N and a fixed pattern of length M, is N/M: in the best case, only one in M characters needs to be checked. This also explains the somewhat counter-intuitive result that the longer the pattern we are looking for, the faster the algorithm will be usually able to find it.
The algorithm precomputes two tables to process the information it obtains in each failed verification: one table calculates how many positions ahead to start the next search based on the identity of the character that caused the match attempt to fail; the other makes a similar calculation based on how many characters were matched successfully before the match attempt failed. (Because these two tables return results indicating how far ahead in the text to "jump", they are sometimes called "jump tables", which should not be confused with the more common meaning of jump tables in computer science.)
The first table is easy to calculate: Start at the last character of the search string with a count of 0, and move towards the first character; each time you move left, increment the count by 1, and if the character you are on is not in the table already, add it along with the current count. All other characters receive a count equal to the length of the search string.
Example: For the string WIKIPEDIA, the first table would be as shown (for clarity, entries are shown in the order they would be added to the table):
Character | Shift |
---|---|
I | 1 |
D | 2 |
E | 3 |
P | 4 |
K | 6 |
W | 8 |
all other characters | 9 |
Readers may question why we didn't add the "A", the character at the very end of the search string. The reason is that the algorithm uses this table to look up a character which causes a mis-match, and the table tells us how many positions ahead in the text to we have to shift before that character could legitimately match in the string; for instance, if we checked the ninth position of the text looking for an "A" and discovered an "I" instead, it would indicate that our next possible match can be found by shifting ahead one, and checking the tenth position of the text for the "A". If we discover an "A", however, we are either discovering it in the last position, in which case it is not a mis-match, or after we have already checked the last position. In the latter case, there is nowhere in the rest of the string the "A" can match, and so no matches are possible until we are entirely past that "A".
The second table is slightly more difficult to calculate: for each value of N less than the length of the search string, we must first calculate the pattern consisting of the last N characters of the search string, preceded by a mis-match for the character before it; then we must find the least number of characters that partial pattern can be shifted left before the two patterns do not mismatch. For instance, for the search string ANPANMAN, the table would be as follows:
N | Pattern | Shift |
---|---|---|
0 | 1 | |
1 | 8 | |
2 | 3 | |
3 | 6 | |
4 | 6 | |
5 | 6 | |
6 | 6 | |
7 | 6 |
It may be easier to see how we derived the figures in the "shift" column if we look at the following table, which actually shows each partial pattern shifted as many places to the left as needed to match against the search string. (The character which must not match is shown here as the lower-case version of that character; thus the pattern "mAN", for instance, can be read as "The string 'AN', preceded by any character except an 'M'".)
ANPANMAN -------- n 1 aN 8 mAN 3 nMAN 6 aNMAN 6 pANMAN 6 nPANMAN 6 aNPANMAN 6
The worst-case performance of the algorithm to find all matches is approximately N*M. This worst case is hit when the string to be searched consists of repetitions of a single character, and the target string consists of M-1 repetitions of that character preceded by a single instance of a different character. In this scenario, the algorithm must check N-M+1 different offsets in the text for a match, and each such check takes M comparisons.
The worst-case to find whether there is a match needs approximately 3*N comparisons. The proof is due to Richard Cole, see R. COLE, Tight bounds on the complexity of the Boyer-Moore algorithm, Proceedings of the 2nd Annual ACM-SIAM Symposium on Discrete Algorithms, (1991) for details.
External link
- Animation of the Boyer-Moore algorithm (http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM1.html)
- An example of the Boyer-Moore algorithm (http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-example.html) - from the homepage of J Strother Moore, co-inventor of the algorithm.