[Home]
[Edit this page]
[Recent Changes]
[Special Pages]
[Help]
Art_Huffman_p1
(Computer science) Huffman Trees for Data Compression
Data Compression is one of the most renowned branches of Computer Science. Over the years, a lot of research has been done in this field, to compress data in numerous ways and many standards have been developed. Data Compression can be defined as reducing of the amount of storage space required to store a given amount of data. Data compression comes with a lot of advantages, it saves storage space, bandwidth, cost and TIME required to transmit data from one place to another. In this article we will throw light on Huffman Coding for data compression.
Table of Contents:
Huffman compression is a Lossless compression technique. By the word lossless I mean that there is no loss of information after the data is compressed i.e. after we decompress the data, we get back the original information. As opposed to this, in Lossy compression as we compress the data there may be some loss of information and we may not get the original information after the decompression process. Lossless compression is desired while compressing text documents, bank records etc. Lossy compression is used in image compression, sound compression etc.
Data encoding schemes broadly fall into two categories, fixed length encoding and variable length encoding. In fixed length encoding all the symbols are encoded using the same number of bits. Suppose we have 8 symbols to be encoded, a fixed length encoding approach will be to use three bits to encode 8 symbols ( 23= 8). An example of fixed length encoding is ASCII code which uses 7 bits to encode a total of 128 different symbols, 8th bit is sometimes used as a parity bit. Problem with fixed length codes is that they never consider the probability of occurrence of the symbols to be encoded. A symbol that occurs 1000 times will be encoded with the same number of bits as a symbol which comes only 10 times. This drawback makes fixed length encoding inefficient for data compression. Variable length encoding overcomes this problem by assigning less number of bits to symbols which come more often and more number of bits to symbols for which frequency of occurrence is less.
The Huffman Encoding scheme falls in the category of variable length encoding i.e. code for a symbol depends on the frequency of occurrence of that symbol. Huffman coding is again classified into two different groups Static Huffman coding and Adaptive Huffman coding.

Static Huffman coding uses statistical Symbol Frequency tables, in which symbol frequencies are known in advance i.e. before the actual coding takes place, we know which symbol will occur how many times. Adaptive Huffman compression gains over its static counterpart in the sense that symbol frequencies need not be known in advance. Symbols are encoded as they are encountered. Arm yourself; we will now attack both of these coding schemes one by one.
In static Huffman coding, symbol frequencies are known in advance i.e. we already know which symbol will occur how many times. Compressing a file can be a good use of static Huffman coding. Before actually compressing the file; we may scan through the file and determine how many times a particular symbol occurs. Once we have the Symbol Frequency table we can now apply coding algorithm and get the compressed codes for the symbols. Table 1 shows a typical example of the Symbol Frequency table of a text file containing a total of 100 characters.
Sometimes a universal statistical symbol frequency table is prepared, which is based on previous experiences and analysis of various input streams, e.g. in most text messages the letter ‘e’ appears more than the letter ‘z’; you may verify this by counting the number of times 'e' and 'z' come in this article
Once we have symbol frequency table, we can proceed with the algorithm to compress the data. To get the codes for the given set of symbols we need to create Huffman tree from them.

Huffman trees are strictly binary trees. Strictly binary tree is a special type of binary tree in which all nodes except the leaf nodes have both right and left children. Symbols to be encoded form the leaf nodes of the tree. Figure on top shows a typical node of a Huffman tree. Strictly binary trees and hence Huffman trees have a property that if there are n leaf nodes (symbols to be encoded) then there will be a total of (2n-1) nodes in the tree. The following algorithm will create the Huffman tree from the given set of symbols and will assign codes to the tree created:
Algorithm for creating Huffman tree
Let us take an example that will make things clearer; we will create a Huffman tree for the symbols in table given earlier.

First we create the leaf nodes from the symbol frequency table and sort them. Symbols ‘f’ and ‘e’ have the least frequencies; 4 and 6 respectively; these 2 nodes are combined to make a node ‘fe’ having frequency 4+6=10. This new node ‘fe’ is now the parent node of the nodes ‘f’ and ‘e’, and ‘fe’ replaces ‘f’ and ‘e’. Again we sort the nodes; now ‘fe’ and ‘c’ have least frequencies i.e. 10 each. This time we combine ‘fe’ and ‘c’ to create a new node ‘fec’ having frequency 20. Nodes ‘fe’ and ‘c’ are replaced be their parent ‘fec’. After sorting again; we now combine ‘d’ and ‘fec’ to create ‘dfec’ having frequency 30, 10 of ‘d’ and 20 of ‘fec’. This ‘dfec’ becomes parent of ‘d’ and ‘fec’ and replaces both of them. It is now time to combine the nodes ‘dfec’ and ‘b’ as they have least frequency. New node will be ‘dfecb’ having frequency of 60; again ‘dfec’ and ‘b’ will be replaced by ‘dfecb’. Now only two nodes are left namely ‘dfecb’ and ‘a’. We again sort them and combine both of them to form ‘adfecb’ which has frequency count of 100. After making ‘adfecb’ parent of ‘a’ and ‘dfecb’ and replacing them with ‘adfecb’; we have created the Huffman tree for the symbols in Table 1. Node ‘adfecb’ is the root of the tree.
N.B.: Frequency count of root of a Huffman tree is same as sum of frequency counts of the leaf nodes.
We still have to assign codes to the tree created. To assign the codes start from the root of tree and go on assigning 1 to every left branch and 0 to every right branch. Tree creation part is over now.
Now the obvious question is; how to find the code for a particular symbol? The answer is, to find the code for a particular symbol; start from that symbol and traverse in up direction towards the root. As soon as you encounter a branch output the code assigned to that branch (1/0); when you reach the root you will have the code for that symbol from LSB to MSB. Suppose we need to find Huffman code for ‘f’; we start from ‘f’ and move toward root i.e. f -> fe -> fec -> dfec -> dfecb -> adfecb; we get code 1, 1, 0, 1, 0 respectively (i.e. 1 for f->fe , 1 for fe -> fec, 0 for fec -> dfec and so on). Note that this code that we have is from LSB to MSB so the final code for ‘f’ will be “01011”. Following table shows Huffman code for all the symbols.
Can you find some relation in the Huffman codes in the table? Well, if you notice carefully you will see that no codeword is also a prefix of another codeword. E.g. codeword for b is 00; now there is no other codeword which begins with 00. Codes having this property are called as Prefix codes. In prefix codes no codeword in the set is a prefix to another codeword. Huffman codes are Prefix codes. This property makes Huffman codes easy to decode as we will find later.
Now suppose we want to transmit “abdceabedf”; we can directly send Huffman code for the each of the symbols from Table 2 i.e. 1 for ‘a’, 011 for ‘d’ and so on. Following table shows what will be transmitted for each of the symbols:
So the transmitted code will look like “1000110100010101000101001101011”. There are 10 characters and it takes only 31 bits to transmit them. If we use normal ASCII code i.e. 7 bit fixed length code then it will take 7*10=70 bits to transmit the same string. So in our case, use of Huffman codes saved 39 bits which is around 55%. In a similar fashion Huffman codes can save from around 20% to a whopping 90% depending on the pattern of data being compressed.
A C implementation of Static Huffman Algorithm described above will be described later. The program will accept the Symbol Frequency Table and produce the Huffman codes for them.
Decoding Huffman codes
We already know how to create Huffman codes for a given set of symbols; it is now time to reverse the process. How to recover the original data from the given Huffman code? As mentioned earlier Huffman compression is Lossless compression, we must get the original data back after decompressing. To decode the Huffman codes we need the Symbol Frequency Table which was used while compressing the data. Without the symbol frequency table we can not recover the data back.
First step involved in decoding the Huffman codes is same as the one used in finding the Huffman codes viz Creation of Huffman Tree from the symbol frequency table. We get the same tree as we got while compressing the data. For decoding Huffman codes we make use of the aforementioned property of the Huffman codes i.e. Huffman codes are Prefix codes. Once we have the Huffman tree as in the above figure, we can easily recover the data back.
Use following algorithm for decoding the Huffman codes:
Algorithm for decoding Huffman Codes
[Edit this page] [Page history] [What links here] [Discuss this topic] [Printer Friendly]
Art_Huffman_p1
Next Page
(Computer science) Huffman Trees for Data Compression
Data Compression is one of the most renowned branches of Computer Science. Over the years, a lot of research has been done in this field, to compress data in numerous ways and many standards have been developed. Data Compression can be defined as reducing of the amount of storage space required to store a given amount of data. Data compression comes with a lot of advantages, it saves storage space, bandwidth, cost and TIME required to transmit data from one place to another. In this article we will throw light on Huffman Coding for data compression.
Table of Contents:
- Introduction
- Lossy v/s Lossless coding.
- Fixed length v/s Variable length coding.
- Huffman Algorithm.
- Decoding Huffman Codes.
- Dynamic/Adaptive Huffman codes.
- Applications of Huffman Compression.
- Program in C to implement static Huffman encoding algorithm.
- Readers' Exercise.
- Disclaimer
Huffman compression is a Lossless compression technique. By the word lossless I mean that there is no loss of information after the data is compressed i.e. after we decompress the data, we get back the original information. As opposed to this, in Lossy compression as we compress the data there may be some loss of information and we may not get the original information after the decompression process. Lossless compression is desired while compressing text documents, bank records etc. Lossy compression is used in image compression, sound compression etc.
Data encoding schemes broadly fall into two categories, fixed length encoding and variable length encoding. In fixed length encoding all the symbols are encoded using the same number of bits. Suppose we have 8 symbols to be encoded, a fixed length encoding approach will be to use three bits to encode 8 symbols ( 23= 8). An example of fixed length encoding is ASCII code which uses 7 bits to encode a total of 128 different symbols, 8th bit is sometimes used as a parity bit. Problem with fixed length codes is that they never consider the probability of occurrence of the symbols to be encoded. A symbol that occurs 1000 times will be encoded with the same number of bits as a symbol which comes only 10 times. This drawback makes fixed length encoding inefficient for data compression. Variable length encoding overcomes this problem by assigning less number of bits to symbols which come more often and more number of bits to symbols for which frequency of occurrence is less.
The Huffman Encoding scheme falls in the category of variable length encoding i.e. code for a symbol depends on the frequency of occurrence of that symbol. Huffman coding is again classified into two different groups Static Huffman coding and Adaptive Huffman coding.

Static Huffman coding uses statistical Symbol Frequency tables, in which symbol frequencies are known in advance i.e. before the actual coding takes place, we know which symbol will occur how many times. Adaptive Huffman compression gains over its static counterpart in the sense that symbol frequencies need not be known in advance. Symbols are encoded as they are encountered. Arm yourself; we will now attack both of these coding schemes one by one.
In static Huffman coding, symbol frequencies are known in advance i.e. we already know which symbol will occur how many times. Compressing a file can be a good use of static Huffman coding. Before actually compressing the file; we may scan through the file and determine how many times a particular symbol occurs. Once we have the Symbol Frequency table we can now apply coding algorithm and get the compressed codes for the symbols. Table 1 shows a typical example of the Symbol Frequency table of a text file containing a total of 100 characters.
Symbol Frequency of occurence
a 40
b 30
c 10
d 10
e 6
f 4Sometimes a universal statistical symbol frequency table is prepared, which is based on previous experiences and analysis of various input streams, e.g. in most text messages the letter ‘e’ appears more than the letter ‘z’; you may verify this by counting the number of times 'e' and 'z' come in this article
Once we have symbol frequency table, we can proceed with the algorithm to compress the data. To get the codes for the given set of symbols we need to create Huffman tree from them.

Huffman trees are strictly binary trees. Strictly binary tree is a special type of binary tree in which all nodes except the leaf nodes have both right and left children. Symbols to be encoded form the leaf nodes of the tree. Figure on top shows a typical node of a Huffman tree. Strictly binary trees and hence Huffman trees have a property that if there are n leaf nodes (symbols to be encoded) then there will be a total of (2n-1) nodes in the tree. The following algorithm will create the Huffman tree from the given set of symbols and will assign codes to the tree created:
Algorithm for creating Huffman tree
- Input all symbols (characters) along with their respective frequencies.
- Create leaf (Right and Left links will be NULL) nodes representing the symbols scanned.
- Let S be set containing all the nodes created in step 2.
- [Create the Huffman Tree] While there is only one node in S
- Sort the nodes (symbols) in S with respect to their frequencies.
- Create a new node to combine the 2 nodes (symbols) with least frequencies.
- Frequency of this new combined node (symbol) will be equal to sum of frequencies of nodes (symbols) which were combined.
- Replace, in S, the 2 nodes (symbols) which were combined with the new combined node (symbol).
- [Assigning the codes to the tree]
Let us take an example that will make things clearer; we will create a Huffman tree for the symbols in table given earlier.

First we create the leaf nodes from the symbol frequency table and sort them. Symbols ‘f’ and ‘e’ have the least frequencies; 4 and 6 respectively; these 2 nodes are combined to make a node ‘fe’ having frequency 4+6=10. This new node ‘fe’ is now the parent node of the nodes ‘f’ and ‘e’, and ‘fe’ replaces ‘f’ and ‘e’. Again we sort the nodes; now ‘fe’ and ‘c’ have least frequencies i.e. 10 each. This time we combine ‘fe’ and ‘c’ to create a new node ‘fec’ having frequency 20. Nodes ‘fe’ and ‘c’ are replaced be their parent ‘fec’. After sorting again; we now combine ‘d’ and ‘fec’ to create ‘dfec’ having frequency 30, 10 of ‘d’ and 20 of ‘fec’. This ‘dfec’ becomes parent of ‘d’ and ‘fec’ and replaces both of them. It is now time to combine the nodes ‘dfec’ and ‘b’ as they have least frequency. New node will be ‘dfecb’ having frequency of 60; again ‘dfec’ and ‘b’ will be replaced by ‘dfecb’. Now only two nodes are left namely ‘dfecb’ and ‘a’. We again sort them and combine both of them to form ‘adfecb’ which has frequency count of 100. After making ‘adfecb’ parent of ‘a’ and ‘dfecb’ and replacing them with ‘adfecb’; we have created the Huffman tree for the symbols in Table 1. Node ‘adfecb’ is the root of the tree.
N.B.: Frequency count of root of a Huffman tree is same as sum of frequency counts of the leaf nodes.
We still have to assign codes to the tree created. To assign the codes start from the root of tree and go on assigning 1 to every left branch and 0 to every right branch. Tree creation part is over now.
Now the obvious question is; how to find the code for a particular symbol? The answer is, to find the code for a particular symbol; start from that symbol and traverse in up direction towards the root. As soon as you encounter a branch output the code assigned to that branch (1/0); when you reach the root you will have the code for that symbol from LSB to MSB. Suppose we need to find Huffman code for ‘f’; we start from ‘f’ and move toward root i.e. f -> fe -> fec -> dfec -> dfecb -> adfecb; we get code 1, 1, 0, 1, 0 respectively (i.e. 1 for f->fe , 1 for fe -> fec, 0 for fec -> dfec and so on). Note that this code that we have is from LSB to MSB so the final code for ‘f’ will be “01011”. Following table shows Huffman code for all the symbols.
Symbol Frequency of occurence Huffman code
a 40 1
b 30 00
c 10 0100
d 10 011
e 6 01010
f 4 01011
Can you find some relation in the Huffman codes in the table? Well, if you notice carefully you will see that no codeword is also a prefix of another codeword. E.g. codeword for b is 00; now there is no other codeword which begins with 00. Codes having this property are called as Prefix codes. In prefix codes no codeword in the set is a prefix to another codeword. Huffman codes are Prefix codes. This property makes Huffman codes easy to decode as we will find later.
Now suppose we want to transmit “abdceabedf”; we can directly send Huffman code for the each of the symbols from Table 2 i.e. 1 for ‘a’, 011 for ‘d’ and so on. Following table shows what will be transmitted for each of the symbols:
a b d c e a b e d f
1 00 011 0100 01010 1 00 01010 011 01011So the transmitted code will look like “1000110100010101000101001101011”. There are 10 characters and it takes only 31 bits to transmit them. If we use normal ASCII code i.e. 7 bit fixed length code then it will take 7*10=70 bits to transmit the same string. So in our case, use of Huffman codes saved 39 bits which is around 55%. In a similar fashion Huffman codes can save from around 20% to a whopping 90% depending on the pattern of data being compressed.
A C implementation of Static Huffman Algorithm described above will be described later. The program will accept the Symbol Frequency Table and produce the Huffman codes for them.
Decoding Huffman codes
We already know how to create Huffman codes for a given set of symbols; it is now time to reverse the process. How to recover the original data from the given Huffman code? As mentioned earlier Huffman compression is Lossless compression, we must get the original data back after decompressing. To decode the Huffman codes we need the Symbol Frequency Table which was used while compressing the data. Without the symbol frequency table we can not recover the data back.
First step involved in decoding the Huffman codes is same as the one used in finding the Huffman codes viz Creation of Huffman Tree from the symbol frequency table. We get the same tree as we got while compressing the data. For decoding Huffman codes we make use of the aforementioned property of the Huffman codes i.e. Huffman codes are Prefix codes. Once we have the Huffman tree as in the above figure, we can easily recover the data back.
Use following algorithm for decoding the Huffman codes:
Algorithm for decoding Huffman Codes
- Start from the root of the tree
- Examine the next element in the input
- If it is 1, move to the left child.
- If it is 0, move to the right child.
- If it is leaf node
- Output its symbol
- Go to step1
- If it is not a leaf node then go to step2
1 00 011 0100 01010 1 00 01010 011 01011
a b d c e a b e d f
So the final output symbols string is “abdceabedf”.
Implementations of Static Huffman encoding faces some problems; a few of them are mentioned below:
- As discussed earlier symbol frequencies must be known in advance. Now this can be a real problem, consider a network in which Huffman compression is applied on the data, at some lower layer e.g. physical layer, before sending them on the communication channel. We can not build the Symbol Frequency table in advance as in most cases we do not know the nature of data being transmitted i.e. we do not know whether next character that will be transmitted is an ‘a’ or ‘z’ or something else. The compression must be applied on the fly, which is not possible with the static Huffman encoding.
- Symbol frequency table must be stored /transmitted along with the code words. This table is needed while decoding the Huffman codes. Various implementations of the Huffman Algorithm use different and tricky ways to store/ transmit the symbol frequency table.
- If we use the universal symbol frequency table we have an advantage that this table may not be stored/ transmitted along with Huffman codes as this table is known in advance to parties involved in the process. Disadvantages of this approach is that in an specific situation input stream may totally mismatch with what is there in universal frequency table, e.g. a text message in which there are more ‘z’ than ‘e’. In that situation we do not get good compression ratio.
- While compressing a file using static Huffman encoding we need to do two passes on the file; in the first pass we make the Symbol Frequency table and in the second pass we actually compress the data.
- For large files containing a large number of characters, building a symbol frequency table is time consuming as disk access is quite slow.
Next Page
[Edit this page] [Page history] [What links here] [Discuss this topic] [Printer Friendly]
