Talk:Burrows-Wheeler transform
|
The text says that EOF is ignored when sorting, but the example seems to suggest that EOF is considered to come after all normal letters. AxelBoldt 14:14 Aug 26, 2002 (PDT)
The dot wasn't an EOF. It was just a dot that was there to make the rotations stand out visually. Hopefully it's clearer now. --LC 08:06 Aug 27, 2002 (PDT)
Obviously sufficiently many repeats of the transform must eventually restore the original text, and obviously that provides another (usually inefficient?) way of inverting a transform (just store the inputs temporarily, and give back the last input before the text to be inverted turns up again). But one interesting thing is this: since the original is itself the result of a transform on some text, clearly the transform does not necessarily always produce something better suited for compression. It must be something that is statistically likely, based on properties that are common in many inputs. What are these, and what are the odds that a transform won't actually help and might even hurt? If we do get one of these cases, how can we tell (cheaply)? And so on. PML.
How can repeating the transform restore the original text? The transform is a not a bijection between strings of length n. It is a one-to-one function from strings of length n into the set of ordered pairs of a string of length n and an integer less than n. Thus, different inputs can produce the same output string with a different index. For example .BANANA. and ANA..BAN produce the same output string.
Bad example
This article gives a really bad example for this algorithm. Take the text:
SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
and replace all repeated substrings (as deflate would detect them) with a special control character which I'll represent as _
:
SIX.M_ED.P_IES._FT_XTY_.DUS_BOX_
Now take the transformed text:
TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIIT
and do the same:
TEXYDST.E.XII_XXSMPPSS.B.__EEUSFXDIOI_T
Clearly, the original string compresses better. And indeed gzip and bzip2 confirm this: gzip compresses the above to 67 bytes, bzip2 to 69 bytes. (Both can hardly be called compression: the original was only 44 bytes ...) — Timwi 14:14, 9 Nov 2004 (UTC)
That is unfortunately the curse of writing about data compression: in a substantial number of cases, it can be almost impossible to find an example small enough to be practical, and clear enough to show the principle involved, but large enough so that the savings achieved with the method can be immediately seen to be significant. Add in a requirement about 'no other method can do this example better' and I am afraid you're asking for the impossible. IMHO, it's much better to use an example that shows the principle clearly and note in the text if needed that 'in this example, the saving from the run-length encoding doesn't overcome the overhead of the encoding; in a larger example which produced longer runs, it would.' After all, generations of textbooks have used recursive Fibonacci algorithms to demonstrate recursion when that's not the most efficient way to calculate Fibonacci numbers.
By the way, have you noticed that your deflate example is incorrect? Deflate operates on strings of minimum length 3. Even with your simplification of all replaced substrings taking as much space as one literal, here's how it would actually look:
SIX.MIXED.PIXIES.SIFT._TY_.DUST.BOXES
vs.
TEXYDST.E.XII_XXSMPPSS.B.__EEUSFXDIOI_T
Not a stellar improvement. -- Antaeus Feldspar 15:11, 9 Nov 2004 (UTC)
- It doesn't improve deflate, but I think it does improve RLE. My impression was that the point is to create runs. Deco 20:42, 9 Nov 2004 (UTC)
- Timwi's point was that in the example given, deflate outperforms BWT. However, his comparison was flawed because he showed deflate getting LZ77 compression from repetition of two-byte sequences. It doesn't; BWT gets compression from repetition of two-byte sequences, but deflate's implementation of LZ77 compression doesn't work on sequences less than three bytes. -- Antaeus Feldspar 02:17, 10 Nov 2004 (UTC)
- If not deflate, then what does bzip2 apply after the BWT? — Timwi 11:31, 10 Nov 2004 (UTC)
- To the best of my knowledge, it used to apply RLE and then arithmetic coding, but because of patent problems, it switched to RLE and then Huffman coding. It wouldn't make sense to apply deflate after the BWT, because while deflate can get compression from repeated byte sequences, long runs of the same character, and relative frequency of some characters over others, the BWT is going to remove almost every instance of repeated byte sequences by re-ordering the characters so that they lengthen runs, instead. Deflate would still be able to work on the runs, but it wouldn't do so as effectively as an encoding step that works only on runs.
- It may help to remember that lossless compression can't create compression from nothing; it can only remove redundancy that already exists. The following table shows which compression methods exploit which kind of redundancy. (Remember that Deflate is LZ77 coding followed by Huffman coding.)
LZ77 | BWT | RLE | Huffman/Arithmetic | |
---|---|---|---|---|
repeated multi-byte sequences | yes, at cost of two parameters (offset, length) | converts into single-character runs | no | no |
single-character runs | yes, at cost of two parameters (offset, length) | no | yes, at cost of single parameter (length of run) | no |
non-flat frequency distribution of characters | no | no | no | yes, exploits optimally |
- As you see, BWT destroys one of the two kinds of redundancy LZ77 can work on, and converts it to the other kind; LZ77 can still work on it, but at twice the cost of RLE. So it wouldn't make sense to use deflate after BWT instead of RLE and Huffman. -- Antaeus Feldspar 16:55, 10 Nov 2004 (UTC)
C implementation
Is there any reason that the C implementation given in the Polish Wikipedia shouldn't be included here? --Shutranm 21:47, 20 May 2005 (UTC)