Arithmetic Coding and Huffman Coding in MATLAB.

Introduction

This page contains MatLab functions, m-files, which do Huffman coding and arithmetic coding of integer (symbol) sequences. Complete coding may be done by calling an easy to use main program (or main function), where input argument is the sequences you want to compress and the output is the compressed bit stream, as a vector of bytes. Decoding is done just by switching the arguments. These main programs are: the Huffman coder, Huff06, and two versions of the arithmetic coder, Arith06.m and Arith07.m.

My interest for lossless coding started when I took a course in Information Theory which included the principles both for Huffman coding and arithmetic coding. My first projects as a Ph.D. student was in signal representation and signal compression. Most of this work was done in MatLab, so I searched for an entropy coder, i.e. lossless coding of a symbol (integer) sequence, that was easy to use from MatLab. What I found was mostly programs written in C, so I tried to make a Huffman Coder as a function (m-file) in MatLab. The function (m-file) was especially made for compression of quantized coefficient values or for values after run-length coding. It is assumed that the histogram of the values, i.e. probabilities, is symmetric about zero (or only non-negative numbers), and that probability decrease as distance from zero increase. The program works ok also if this it not the case, but the compression ratio is not as good as it will be if the sequences are as expected. For example this Huffman coder is not the best to use if you want to compress a text file, where the symbols are the ASCII-codes (bytes), or a general file where you just have a sequence of bytes.

An entropy coder usually only exploits the symbol probabilities independent of previous symbols, this is optimal for uncorrelated sequences. For signal compression a decorrelation process usually precede the entropy coding, but often the decorrelation is not perfect. The Huffman coder was made such that it can exploit some of these remaining dependencies; this was done by manipulating the input sequence. The Huffman coder worked quite well, and we wrote an article on it. The article, "Improved Huffman coding using recursive splitting", norsig99.pdf (172 kB), was presented at the NORSIG-99 conference 9-11 sep. 1999. At the end of this page the abstract of the article is included.

Since then I have updated the program several times, the discovered errors have been corrected, the interface (arguments passed and returned) were changed. Now almost any kind of integer sequences can be compressed, but it will still work best if the initial assumptions are met or nearly met. I have also made an arithmetic coder in MatLab, two versions of this are also available on this page.

MATLAB files

comp.zip, 37 kB. This file contains the Huffman coder, Huff06, and two versions of the arithmetic coder, Arith06.m and Arith07.m. A description of the m-files is in comp.pdf, 54 kB.
ecg.zip, 192 kB. This file contains a part of the MIT100 ECG signal. This is the part of the signal that was used in the project.

Abstract of NORSIG 1999 article.

Lossless compression of a sequence of symbols is an important part of data and signal compression. Huffman coding is lossless, it is also often used in lossy compression as the final step after decomposition and quantization of a signal. In signal compression, the decomposition/quantization part seldom manages to produce a sequence of completely independent symbols. Here we present a scheme giving better results than straightforward Huffman coding by utilizing this fact. We split the original symbol sequence into two sequences in such a way that the symbol statistics are, hopefully, different for the two sequences. Individual Huffman coding for each of these sequences will reduce the average bit rate. This split is done recursively for each sub-sequence until the cost associated with the split is larger than the gain.

Experiments were done on different signals. They were decomposed with the Discrete Cosine Transform, quantized, and End Of Block coded, to give the input symbol sequence. The results using the split scheme was a bit rate reduction of usually more than 10% compared to straightforward Huffman coding, and 0-15% better than JPEG-like Huffman coding, best at low bit rates.

This WWW page is maintained by Karl Skretting at University of Stavanger.
Last update: Oct. 22. 2010.