Design,
Code,
and HCI

Burrows-Wheeler Data Compression

Java
Spring 2018

Overview

The Burrows-Wheeler data compression algorithm forms the basis of Unix compression utility bzip2. It outcompresses gzip and PKZIP and is not protected by any patents. The algorithm comprises of three algorithmic components, which are applied in succession:

Burrows–Wheeler transform: transform a typical English text file into a text file in which sequences of the same character occur near each other many times.

Move-to-front encoding: converts a text file in which sequences of the same character occur near each other many times into a text file in which certain characters appear more frequently than others. This order makes for more efficient compression.

Huffman compression: compress a text file in which certain characters appear more frequently than others by encoding frequently occurring characters with short codewords and infrequently occurring characters with long codewords.

I implemented the Burrows-Wheeler data compression algorithm by using the Huffman algorithm provided by Algorithms, 4th Edition by Sedgewick and Wayne and writing the move-to-front encoding/decoding as well as Burrows-Wheeler transform and inverse transform components.

Approach

To implement move-to-front encoding and decoding, I kept an ordered sequence of the 256 extended ASCII characters in an array, where the Ith index in the array corresponded to the Ith extended ASCII character. Next, I read each 8-bit character from the input string one at a time. For encoding, the index corresponding to the read character was written to output, and the character was moved to the front of the sequence. For decoding, the character was casted to its integer representation, and character indexed at that integer was written to output while the character itself was moved to the front of the sequence.

Next, to implement the Burrows-Wheeler transform, I created a circular array that is an abstraction of a sorted array of n circular prefixes of a string of length n (it’s an abstraction to avoid storing all n sorted prefixes to optimize memory and performance). The circular array was then used as the main mechanism of the Burrows-Wheeler transform to write the correct characters to output using the results of move-to-front encoding.

I then implemented an inverse Burrows-Wheeler transform, which was a little trickier and more interesting to think about. Since I had the sorted prefix array I created earlier, I can map the rows in which the non-sorted prefixes appear to the sorted rows. Once I figured out these mappings, it was possible to establish a connection between the transformed and original strings. Upon storing the mappings and then using key-indexed counting to sort the mappings, I obtained an array of characters that matched the original string as a by-product of the sorting operation.

Conclusion

After running some timing tests and comparing the compression times and ratios of gzip and Burrows-Wheeler, I concluded that generally speaking, the encoding and decoding time in gzip is much faster but the compression ratio is inferior to Burrows-Wheeler on very small texts and very large texts. For anything in the middle, the compression ratio between the two algorithms are comparable.

This assignment makes use of standard libraries from the course. All code from this assignment is currently stored on a private Github repository due to course policies and may be available upon special request.