Implementation of Huffman Coding for lossless text file compression

Vedant Bhagwan Wani
6 min readJul 12, 2021

Huffman coding is a lossless data compression algorithm. The idea is to assign variable-length codes to input characters, lengths of the assigned codes are based on the frequencies of corresponding characters. The most frequent character gets the smallest code and the least frequent character gets the largest code.

The variable-length codes assigned to input characters are Prefix Codes, means the codes (bit sequences) are assigned in such a way that the code assigned to one character is not the prefix of code assigned to any other character. This is how Huffman Coding makes sure that there is no ambiguity when decoding the generated bitstream.

Let us understand prefix codes with a counter example. Let there be four characters a, b, c and d, and their corresponding variable length codes be 00, 01, 0 and 1. This coding leads to ambiguity because code assigned to c is the prefix of codes assigned to a and b. If the compressed bit stream is 0001, the de-compressed output may be “cccd” or “ccb” or “acd” or “ab”.
See this for applications of Huffman Coding.

There are mainly two major parts in Huffman Coding-

  1. Build a Huffman Tree from input characters.
  2. Traverse the Huffman Tree and assign codes to characters.

A detailed explanation of the working and general implementation can be found below as linked:

My complete program for file compression is on GitHub -

myCompressor.cpp

Now, assuming you’ve gotten familiar with the working principal of the algorithm, lets proceed on with the entirety of my code:

Lets from the int main() as any program execution starts at main(), and ends at main().

First, we input the name of the file(along with extension) to be compressed. The file stream object fin is opened in input mode for the said file.

Then we call the function fillFreqArray(fin); which as the name suggests reads from the input file and maps each distinct character present in it to its frequency. We also add a pseudo EndOfFilecharacter (which I chose to be the character with ASCII value 129) with frequency exactly 1 ,the huff-code of which is to be appended at the end of our compressed file. We use the following global variables:

int freq[256], size=0; //actual number of elements stored in the freq arraychar arr[256];

Meanwhile we also calculate the totcharactersin it(use of which is discussed later) Once the frequency arrays is generated, we give a function call to the soul of our of our program,

void generateHuffmanCodes(char data[], int freq[], int size)

It builds a Huffman Tree from the freq arrays and generates codes by traversing the built Huffman Tree. This dictionary of codes is what will be used for encoding, and then decoding the files. So we save this code mapping in a “dictionary.txt” file to be later used for decoding the compressed file as and when required.

Next call towriteToDictionary() does precisely that. As the output is a txt file, it is readable by Human brain. This however is not to be the case with our (about to be) generated compressed file. As the dictionary would contain no more than 256 mapping, its maximum size is a constant( a few kBs ). But our main file (and so the compressed one) has no limit on its size, so would like to optimize the memory used in this case as much as possible.

One important point to take note of is that C++ language has char as the fundamental data type. This is what enabled us to so conveniently read and write one char at a time in our text files. This however, also means that a chunk of one byte(size of char) is the smallest we can go while interacting with our file. But as we know, our output compressed file will consist only of zeros and ones, it is therefore not wise to waste 8 bits of memory for representing only 0 or 1.

A workaround for this is implemented in the BitwiseWrite class. The principle idea behind it is that it takes the information on the bit to be written one by one from a stream of strings (which are the character-wise huff-codes of the original file) and encodes them into an unsigned char 8 bits at a time.

As an unsigned char has 8 bits, we store the encoding of our compressed file into these bit positions( chunks of 8 at a time) and then write this chunk byte by byte into the output file and clear the buffer. This helps keep the memory requirements independent of the size of file being compressed.

Now the the size the compressed file is reduced by 8 times, than if we would’ve stored the bit-wise information in characters of zeros and ones, in a way readable by normal human eyes( you can still train your brain to read an encoded binary file if you want!) .

Anyways, coming to our main() our next call to writeToFile(fin,fout) does exactly what mentioned above. As this (reading and writing to file) takes the most amount of time in our entire program(depending on the size it even take several minutes!), I decided to add a classic progress bar of sorts while we encode into the compressed file one char conversion at a time(we can calculate the spontaneous progress as we already calculated the tot characters in original file).

After our file writing process is done, we give a final call to cal_compressionRatio() from the size of the output file(calculated while writing into it) and the size of input file. With this our program for compression reaches its completion!

myDecompressor.cpp

Just when I thought all work was done, I realised, we’d also have to write a program that would losslessly retrieve the original text file from the Dictionary and the Decompressed file we created earlier !

So once again diving headfirst to the int main().

First, we ask the user to input the filename of the Dictionary which stores the character-code mapping.

The we associate an object codestream of ifstream class with this file and build the Huffman Decoding tree(same as the coding tree) with a call to buildDecodingTree(codestream) of the user defined class, Huffman in which the necessary utility functions for building the tree and decoding have been defined.

Once the tree is built, we input the compressed file from the user and use the class BitWiseReadread chunks of 8 bits at a time from it. The concept is complementary to that of BitwiseWrite class which we discussed earlier.

We create a single output file,“Decompressed.txt” and write the decoded characters one by one from the stream of zeros and ones coming from the codestream object. We continue to stream the input till the decoded character is our personal pseudo EndOfFilecharacter which we had appended while compressing the file to avoid ambiguities with the eof character(s) used by the compiler and the OS.

Every decoded character was simultaneously flushed into the output file (except our EndOfFile) to keep the memory requirements in buffer independent of the file size.

This brings us the to complete end of file compression and decompression algorithm. I had a lot of fun doing this self-project and a learnt a lot of interesting new stuff and their applications. I especially had the most fun working on the bit stream packing class. Making it work with arbitrarily sized buffers and writing the mechanism to avoid buffer overflows helped me learn so much about the good old C++. It’s various idiosyncrasies, it’s weaknesses and strengths, memory management, code organisation and better use of test programs. I know, I still have a long way to go, but this project has been a important step towards that destination.

--

--