CPS 100 FALL 1996: Assignment #6

Due: Wednesday, Dec. 11, by 8am

Last Date to Turn in!: Friday, Dec. 13 by 8am

40 points

For this assignment only, you are allowed and STRONGLY urged to work closely with someone else in CPS 100 and turn in ONE submission with both names on it.

Introduction

This assignment involves implementing a compression/uncompression program based on the greedy Huffman algorithm discussed in Weiss (377 -- 384) and in class.

The resulting program will be a complete and useful compression program although not, perhaps, as powerful as the standard Unix program compress or zip which use different algorithms permitting a higher degree of compression than Huffman coding. You should take care to develop the program in an incremental fashion. You may need to modify the program to meet new specifications, so a good design will prove beneficial. Your program will also need to accept command-line parameters that specify the filename (rather than prompting the user or reading from cin). More information on command-line parameters is given below.

Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, an iterative enhancement development of the program is suggested below. You are by no means constrained to adopt this approach, but it might prove useful if your final program doesn't work to show useful and working programs along the way.


Assignment Details

The files for this assignment can be found in ~rodger/cps100/assign6. These files are itemized below

You should copy these files to get started. In particular, the file globals.h provides declarations needed by both huff and unhuff programs (compression/uncompression programs). The file filenames.cc shows (very rudimentarily) how to parse command line arguments.

You are to write two programs:

huff.cc
This program compresses a file
unhuff.cc
This program uncompresses a (previously compressed) file

Both programs may eventually read from a file specified by the user on the command line, but initially the user can be prompted for the program or you can read input from cin and use I/O redirection to read from a file (if this doesn't make sense, ask). The program huff.cc should write its (compressed) output to a file also specified in some way (by user, on command line, or to cout). The line below might compress the file foo.cc storing the compressed version in foo.cc.H.

               huff foo.cc foo.cc.H
If your program reads/writes from cin/cout this could be done using: huff < foo.cc > foo.cc.H

As described below in the development section, you MUST use a class-based approach to this problem. This means, for example, that you should implement at least one class. Some suggestions are provided in the development section.


The program huff.cc

You must write a program huff.cc that will compress a file so that it can be uncompressed by unhuff.cc. In writing huff.cc you'll implement at least one class, this class will also provide support for the decompression program unhuff.cc. If compressing a file would result in a file larger than the file being compressed (this is always possible) then no compressed file should be created and a message should be printed indicating that this is the case. The user should have the option of invoking huff with an argument -f which forces compression even when the compressed file is bigger. For example:

teer4> huff myprog.c mprog.c.H no compression yielded no output file written teer4> huff -f myprog.c myprog.c.H

where the second time huff is executed the compression is forced because of the -f argument. Only the first argument to the program can be the -f argument. We'll talk about processing command line arguments in class, but see filenames.cc too, you may need to use C-style strings to process command-line arguments.

Determining Compression Factor/Savings

To determine if compression results in a smaller file, you'll need to determine the number of characters in the original file (your program will compute this by determining character counts). The size of the compressed file can be calculated from the same character counts using the size of each character's encoded number of bits. You must also remember to calculate the file-header information stored in the compressed program. Don't spend time worrying how to process command-line arguments until you're sure the program works, this isn't the most important part of the program.

The program unhuff.cc

The program should be robust enough not to crash if it is given a program to uncompress that wasn't compressed using the corresponding Huffman compression program. The robustness of the program will be an important criterion in grading the program. There are a variety of methods that you can use to ensure this works, but it will most likely always be possible to ``fool'' the program.

One easy way to ensure that compression/decompression work in tandem is to write a "magic number" at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding uncompression program first reads N bits, if the bits don't represent the "magic number" then the compressed file is not properly formed.


Developing the programs

You may want to develop the programs in stages to ease debugging and decrease program development time. One approach is suggested below, but modifications of this approach are certainly possible. You should also use a class to encapsulate functions that are common to both compression and decompression programs. One idea is to write a class HuffStuff that includes routines used in both programs. You'll need to think of what member functions to implement, three suggestions are given below to help you get started.

Class Member functions

CountCharFreqs
Count the number of times each "character" (a character is described below) occurs in a file that will be compressed.

FreqsToTree
Use frequency counts to build a forest of trees and then a single tree that will be used to establish the coding of each character. You may decide that this function and the function TreeToTable below should be private, and called by some public member function like FreqsToCodes. Don't let your design and imagination be limited by these suggestions.

TreeToTable
Use the tree, and some kind of tree-traversal, to determine the coding for each character. You'll need to determine the bit-pattern and the number of bits for each character. The number of bits is need since the bit patterns 00101 and 0101 are different, but are both represented as numbers by 5.

You'll need to provide more functions and decide on what private data members are needed.

Iterative Enhancement

Some steps that may be useful in developing the program are described below. It's important to develop your program a few steps at a time. At each step, you should have a functioning program, although it may not do everything the first time it's run. By developing in stages, you'll find it easier to isolate bugs and you'll be more likely to get a program working faster.


Coding Details

You will need to create an array to store ``character'' counts (character is in quotes because counts of 8-bit sequences are counted, but there are 8 bits in a C/C++ char). Each element of this array will be intialized to zero, the final counts will be the weights of the one node trees in the initial forest of trees that's used to construct the Huffman tree.

In the file globals.h the number of characters counted is specified by HUFF_ALPH_SIZE which has value 257. Although only 256 values can be represented by 8 bits, one character is used as the pseudo-EOF character.

Every time a file is compressed the count of the the number of times that pseudo-EOF occurs should be one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.

Pseudo-EOF character

The operating system will buffer output, that is output to disk actually occurs when some internal buffer is full. In particular, it is not possible to write just one single bit to a file, all output is actually done in "chunks", e.g., it might be done in eight-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you can't be sure precisely when they're written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Because of the potential for these "extra" bits, you cannot read all bits until there are no more since your program might then read the extra bits written due to buffering. This means that when reading a compressed file, you CANNOT use code like this. int bits; while (input.ReadBits(1,bits)) { // process bits } To avoid this problem, you'll use a pseudo-EOF character and write a loop that stops when the pseudo-EOF character is read in (in compressed form). The code below is pseudo-code for reading a compressed file. int bits; while (....) { if (! input.ReadBits(1,bits)) { cerr << "should not happen! trouble reading bits" << endl; } // use rightmost bit of bits to traverse Huffman coding tree // if leaf is reached, decode the character // if character is pseudo-EOF, then decompression done // otherwise output character and reset to top of free for // reading more bits

When a compressed file is written the last bits written should be the bits that correspond to the pseudo-EOF char. You'll have to write these bits explicity. These bits will be recognized by the program unhuff.cc and used in the decompression process. In particular, when using unhuff a well-formed compressed file will be terminated with the encoded form of the pseudo-EOF char (see code above). This means that your decompression program will never actually run out of bits if it's processing a properly compressed file (think about this). In other words, when decompressing (unhuffing) you'll read bits, traverse a tree, and eventually find a leaf-node representing some character. When the pseudo-EOF leaf is found, the program can terminate. If reading a bit fails because there are no more bits (the bitreading function returns false) the compressed file is not well-formed.


Compressed Header

For decompression to work with Huffman coding, information must be stored in the compressed file that allows the Huffman tree to be re-created so that decompression can take place. There are many options here. You can store all codes and lengths as normal (32 bit) C/C++ ints or you can try to be inventive and save space. For example, it's possible to store just counts and recreate the codes. It's also possible to store code-lengths and codes using bit-at-a-time operations. Any solution to storing information in the compressed file is acceptable. If you use a successful space-saving technique you can earn several points of extra credit (up to 5). There are some suggestions in Weiss for this.


Reading Files Twice

Finally, because Huffman coding requires two passes over the input file you will need to use the method Rewind that works with object variables of type ibstream --- see the documentation in bitops.h . The effect of ib.Rewind() is to allow the input bit stream ib to be ``read again'', i.e., a loop that checks for EOF will be able to read the entire file again --- see the example file bitread.cc.

Priority Queues

You should use a heap-based, templated priority queue class given in tpqueue.h and tpqueue.cc. Alternatively, you can use (or modify) the templated priority queue class given in Weiss (page 643 in Chapter 20). All code from the book can be found in ~rodger/weiss. The difference is that the Weiss priority queue requires that the template parameter represent an ordered type (so that min can be determined). The code provided in tpqueue.cc and tpqueue.h has an int priority that is not "part" of the class/type being inserted into the priority queue.

In either case, you'll need to either create and use a template.cc file with the proper class instantation for each templated class used in your program (similar to what was done in the word-ladder assignment).


Using bitops.cc

In order to read and write in a bit-at-a-time manner, the file bitops.cc and the corresponding header file bitops.h must be used. Two classes are specified for reading/writing bits-at-a-time: ibstream and obstream, respectively.

Bit read/write subprograms

To see how the Readbits routine works, note that the code segment below is functionally equivalent to the Unix cat foo command --- it reads BITS_PER_WORD bits at a time (which is 8 bits as shown in bitops.h) and echoes what is read.

    int inbits;
    ibstream ibs("foo");
    while (ibs.Readbits(BITS_PER_WORD,inbits))
    {
        cout.put(inbits);
    }
this code is similar to the loop shown below which uses the standard function to read a char from cin.
    char inbits;
    while (cin.get(inbits))
    {
        cout.put(inbits);
    }

Note that although Readbits can be called to read a single bit at a time (by setting the first parameter to 1), the second parameter to the function is an int. You'll need to be able to access just one bit of this int (inbits in code above). In order to access just the right-most bit a bitwise and & can be used:

int inbits; ifstream ibs("filename"); ibs.Readbits(1,inbits); if ( (inbits & 1) == 1) // read the bit 1 else // read the bit 0

Alternatively, the function KthBit can be used to extract a specific bit from an int --- see the specification in bitops.h and note that the right-most bit is the first bit. You may find it useful in creating Huffman codes to use the shiftleft operator: <<. and the bit-wise or operator: |. Be careful in using shiftleft (and shiftright) because of potential confusion with stream operators. If you fully parenthesize expressions trouble can usually be avoided. An example program using the bitreading classes is provided for you to study, it is called bitread.cc.

Writing Everything

Executing the C++ statement cout.put(7) results in 8 bits being written because a C/C++ char uses 8 bits (the 8 bits correspond to the character '7'). Executing cout << 7. results in 32 bits being written because a C/C++ int uses 32 bits. Executing obs.Writebits(7,3) results in 3 bits being written (to the output bit-stream obs) --- all the bits are 1 because the number 7 is represented in base two by 111.

When using Writebits to write a specified number of bits, some bits may not be written because of some buffering that takes place. To ensure that all bits are written you should call Flushbits. Note that some buffering is done with Readbits as well, but you shouldn't need to worry about this. The function Flushbits only needs to be called once. If you call Flushbits more than once when writing the compressed file you will likely get erroneous bits when reading them back in.


Command-line Arguments

When a C/C++ program is invoked, arguments are often given as command-line arguments/parameters, i.e., options entered on the same line as the program. For example, g++ -g -o foo foo.cc invokes the g++ compiler (program) with four command-line arguments: -g, -o, foo, and foo.cc. Command-line arguments are passed to the program in a C-style array of strings along with a count of the number of command-line arguments passed. The header for main must be altered in a program that processes command-line arguments. main(int argc, char *argv[]) { if (argc == 1) { cout << "program has NO command-line arguments" << endl; cout << "name of program is " << argv[1] << endl; } else { int k; cout << "program " << argv[0] << " has arguments:" << endl; for(k=1; k < argc; k++) { cout << "arg: " << k << " = " << argv[k] << endl; } } } As shown in the code fragment above, and in the sample file filenames.cc, all programs have at least one command-line argument, the name of the program being run. This is stored as the first entry in the array argv (the first entry has index zero).

The array argv is an array of c-style strings. These strings are just pointers to characters, with a special NUL-character '\0' to signify the last character in a C-style string. You do NOT need to know this to manipulate these char *, C-style strings. The easiest thing to do is to assign each element of argv to a C++ string variable as shown in filenames.cc. Then you can use "standard" C++ string functions to manipulate the values, e.g., you can call length(), you can use substr(), you can concatenate strings with +, etc. None of these operations work with C-style, char * strings. Assign each element of argv to a C++ string variable for processing.


Submitting Programs

When your programs compile and produce the correct output, create a "README" file (please use all capital letters). Include your name, the date, and an estimate of how long you worked on the assignment in the "README" file. You must also include a list of names of all those people (students, prof, tas, tutor) with whom you consulted on the assignment. See the rules for collaboration in the CPS 100 syllabus.

For this assignment only, you are allowed and STRONGLY urged to work closely with someone else in CPS 100 and turn in ONE submission with both names on it. Make sure both names are listed as authors in a comment at the top of each file and also in the README file.

To submit your programs electronically type (where file1 file2 ... are all the .cc and .h files needed by your program):

   submit100 assign6 huff.cc unhuff.cc README Makefile file1 file2 ...

The Makefile should be able to make executables huff and unhuff . You need to submit all .h and .cc files that are modified files of any files given to you, plus any others you created.

You should receive a message telling you that the program was submitted correctly. If it doesn't work try typing ~rodger/bin/submit100 in place of submit100 above.