Range encoding
|
Range encoding is a data compression method that is believed to approach the compression ratio of arithmetic coding, without the patent issues of arithmetic coding. Like arithmetic coding, range encoding conceptually encodes all the symbols of the message into one number, unlike Huffman coding which assigns each symbol a bit-pattern and concatenates all the bit-patterns together. Thus range encoding can achieve greater compression ratios than the one-bit-per-symbol upper bound on Huffman encoding and it does not suffer the inefficiencies that Huffman does when dealing with probabilities that are not exact powers of two.
The central concept behind range encoding is this: given a large-enough range of integers, and a probability estimation for the symbols, the initial range can easily be divided into sub-ranges whose sizes are proportional to the probability of the symbol they represent. Each symbol of the message can then be encoded in turn, by reducing the current range down to just that sub-range which corresponds to the next symbol to be encoded. The decoder must have the same probability estimation the encoder used, which can either be sent in advance, derived from already transferred data or be part of the compressor and decompressor.
When all symbols have been encoded, merely identifying the sub-range is enough to communicate the entire message (presuming of course that the decoder is somehow notified when it has extracted the entire message). A single integer is actually sufficient to identify the sub-range, and it may not even be necessary to transmit the entire integer; if there is a sequence of digits such that every integer beginning with that prefix falls within the sub-range, then the prefix alone is all that's needed to identify the sub-range and thus transmit the message.
Example
Suppose we want to encode the message "AABA<EOM>", where <EOM> is the end-of-message symbol. For this example it is assumed that the decoder knows that we intend to use base 10 number system, an initial range of [0000000000, 1000000000) and the frequency algorithm {A: .60; B: .20; <EOM>: .20}. The first symbol breaks down the range [0000000000, 1000000000) into three subranges:
A: [0000000000, 0600000000) B: [0600000000, 0800000000) <EOM>: [0800000000, 1000000000)
Since our first symbol is an A, it reduces our initial range down to [0000000000, 0600000000). The second symbol choice leaves us with three sub-ranges of this range, we show them following the already-encoded 'A':
AA: [0000000000, 0360000000) AB: [0360000000, 0480000000) A<EOM>: [0480000000, 0600000000)
With two symbols encoded, our range is now [0000000000, 0360000000) and our third symbols leads to the following choices:
AAA: [0000000000, 0216000000) AAB: [0216000000, 0288000000) AA<EOM>: [0288000000, 0360000000)
This time it is the second of our three choices that represent the message we want to encode, and our range becomes [0216000000, 0288000000). It looks harder to determine our sub-ranges in this case, but it is actually not: we can merely subtract the lower bound from the upper bound to determine that there are 72,000,000 numbers in our range; that the first 43,200,000 of them represent .60 of the total, the next 14,400,000 represent the next .20, and the remaining 14,400,000 represent the remaining .20 of the total. Adding back the lower bound gives us our ranges:
AABA: [0216000000, 0259200000) AABB: [0259200000, 0273600000) AAB<EOM>: [0273600000, 0288000000)
Finally, with our range narrowed down to [0216000000, 0259200000), we have just one more symbol to encode. Using the same technique as before for dividing up the range between the lower and upper bound, we find the three sub-ranges are:
AABAA: [0216000000, 0241920000) AABAB: [0241920000, 0250560000) AABA<EOM>: [0250560000, 0259200000)
And since <EOM> is our final symbol, our final range is [250560000, 259200000). The leading zero is omitted, since 1,000,000,000 was the excluded upper bound of our initial range. Because all nine-digit integers starting with "251" fall within our final range, it is one of the three-digit prefixes we could transmit that would unambiguously convey our original message. (The fact that there are actually eight such prefixes in all implies we still have inefficiencies. They have been introduced by our use of base 10 rather than base 2.)
The central problem may appear to be selecting an initial range large enough that no matter how many symbols we have to encode, we will always have a current range large enough to divide into non-zero sub-ranges. In practice, however, this is not a problem, because instead of starting with a very large range and gradually narrowing it down, the encoder works with a smaller range of numbers at any given time. After some number of digits have been encoded, the leftmost digits will not change. In the example after encoding just three symbols, we already knew that our final result would start with "2". Like arithmetic coding, more digits are shifted in on the right as digits on the left are sent off. Arithmetic coding can be thought of as a form of range encoding with the range starting at zero and extending to one.
See also
External links
- Range Encoder (http://www.compressconsult.com/rangecoder/)
- "Range coder" by Arturo Campos (http://www.arturocampos.com/ac_range.html)