Hash function
|
A hash function is a function that converts an input from a (typically) large domain into an output in a (typically) smaller range (the hash value, often a subset of the integers). Hash functions vary in the domain of their inputs and the range of their outputs and in how patterns and similarities of input data affect output data. Hash functions are used in hash tables, cryptography and data processing. A good hash function is one that experiences few hash collisions in the expected domain of values it will have to deal with; that is, it would be possible to uniquely identify most of these values using this hash.
Formally speaking, a hash function is defined by its domain (typically variable-length strings of bytes), by its range (typically fixed length bit-sequences) and by the defining function (call this function H). The desired characteristic of a hash function in general is that H(x) = H(y) probably implies that x = y. Unfortunately, the use of probably here is precise though indeterminate. If the set of all possible values of H(x) is much smaller than the set of all possible values of x, then this latter condition cannot be true if all sequences are equally likely. Thus, the second condition is often revised in different ways to make it plausible, or workable. For instance, cryptographic hash functions use the second condition in the form that given x, it is computationally very difficult to find y such that H(x) = H(y). Additionally, it is desirable that if we are given x and H(x + s) where + can be bit changing or concatenation, then we can't find s short of exhaustive enumeration.
If, given a message x, it is computationally infeasible to find a message y not equal to x such that H(x) = H(y) then H is said to be a weakly collision-free hash function. On the other hand, a strongly collision-free hash function H is one for which it is computationally infeasible to find any two messages x and y such that H(x) = H(y).
In practice, for most applications other than error correction, cryptographic hashes serve very nicely. The MD5 and SHA-1 algorithms are two popular but compromised algorithms for generating cryptographic hash functions; the SHA-2 algorithm has no known compromises.
Contents |
Hash tables
Hash tables, a major application for hash functions, enable fast lookup of a data record given its key. (Note: Keys are not usually secret as in cryptography, but both are used to "unlock" or access information.) For example, keys in an English dictionary would be English words, and their associated records would contain definitions. In this case, the hash function must map alphabetic strings to indexes for the hash table's internal array.
The generally impossible/impractical ideal for a hash table's hash function is to map each key to a unique index (see perfect hashing), because this guarantees access to each data record in the first probe into the table. Hash functions that are truly random with uniform output (including most cryptographic hash functions) yield good performance because, on average, only one or two probes will be needed (depending on the load factor). Perhaps as important is that excessive collision rates with random hash functions are highly improbable--if not computationally infeasible for an adversary. However, a small, predictable number of collisions are virtually inevitable (see birthday paradox).
In many cases, a heuristic hash function can yield many fewer collisions than a random hash function. Heuristic functions take advantage of regularities in likely sets of keys. For example, one could design a heuristic hash function such that file names such as FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, etc. map to successive indexes of the table, meaning that such sequences will not collide. Beating a random hash function on "good" sets of keys usually means performing much worse on "bad" sets of keys, which can arise naturally--not just through attacks. Bad perfomance of a hash table's hash function means that lookup can degrade to a costly linear search.
One hash function that exhibits properties of both random hash functions and heuristic hash functions is Bob Jenkins' LOOKUP2 (http://burtleburtle.net/bob/c/lookup2.c) hash function, published in an article (http://burtleburtle.net/bob/hash/doobs.html) in Dr. Dobb's Journal. The hash function performs well as long as there is no adversary, for it is trivially reversible and useless as a cryptographic hash function.
Error correction
For error correction, a distribution of likely perturbations is assumed at least approximately. Perturbations to a string are then classified into large (improbable) and small (probable) errors. The second criterion is then restated so that if we are given H(x) and x+s, then we can compute x efficiently if s is small. Such hash functions are known as error correction codes. Important sub-class of these correction codes are cyclic redundancy checks and Reed-Solomon codes.
Audio identification
For audio identification such as finding out whether an MP3 file matches one of a list of known items, one could use a conventional hash function such as MD5, but this would be very sensitive to highly likely perturbations such as time-shifting, CD read errors, different compression algorithms or implementations or changes in volume. Using something like MD5 is useful as a first pass to find exactly identical files, but another more advanced algorithm is required to find all identical items. Contrary to an assumption often voiced by people who don't follow the industry carefully, hashing algorithms do exist that are robust to these minor differences. Most of the algorithms available are not extremely robust, but some are so robust that they can identify music played on loud-speakers in a noisy room.
Rabin-Karp string search algorithm
Rabin-Karp string search algorithm is a relatively fast string searching algorithm that works in O(n) time on average. It is based on the use of hashing to compare strings.
Origins of the term
The term "hash" apparently comes by way of analogy with its standard meaning in the physical world, to "chop and mix." Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo dated January 1953; the term hash came into use some ten years later.
In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words" which are then "mixed" with one another using carefully chosen mathematical functions. The range ("hash value") is made to be a definite size, 160 bits (which may be either smaller or larger than the domain), through the use of modular addition.
See also
- Cryptography
- Cryptographic hash function
- HMAC
- Geometric hashing
- Message digest
- Perfect hash function
- Rabin-Karp string search algorithm
- Zobrist Hashing
References
- Robust Audio Hashing for Content Identification (http://www.extra.research.philips.com/natlab/download/audiofp/cbmi01audiohashv1.0.pdf)
External links
- Hash Functions for Hash Table Lookup (http://burtleburtle.net/bob/hash/evahash.html) by Bob Jenkins
- Hash Functions (http://www.azillionmonkeys.com/qed/hash.html) by Paul Hsieh
- What is a hash function? (http://www.rsasecurity.com/rsalabs/node.asp?id=2176) from RSA labscs:Hašovací funkce
de:Hash-Funktion es:Funcin hash fr:Fonction de hash ja:ハッシュ関数 lt:Maišos funkcija pl:Funkcja haszująca ru:Хэш-функция