Due: Wednesday, Dec. 11, by 8am
Last Date to Turn in!: Friday, Dec. 13 by 8am
40 points
For this assignment only, you are allowed and STRONGLY urged to work closely with someone else in CPS 100 and turn in ONE submission with both names on it.
The resulting program will be a complete and useful compression program although not, perhaps, as powerful as the standard Unix program compress or zip which use different algorithms permitting a higher degree of compression than Huffman coding. You should take care to develop the program in an incremental fashion. You may need to modify the program to meet new specifications, so a good design will prove beneficial. Your program will also need to accept command-line parameters that specify the filename (rather than prompting the user or reading from cin). More information on command-line parameters is given below.
Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, an iterative enhancement development of the program is suggested below. You are by no means constrained to adopt this approach, but it might prove useful if your final program doesn't work to show useful and working programs along the way.
You should copy these files to get started. In particular, the file globals.h provides declarations needed by both huff and unhuff programs (compression/uncompression programs). The file filenames.cc shows (very rudimentarily) how to parse command line arguments.
You are to write two programs:
Both programs may eventually read from a file specified by the user on the command line, but initially the user can be prompted for the program or you can read input from cin and use I/O redirection to read from a file (if this doesn't make sense, ask). The program huff.cc should write its (compressed) output to a file also specified in some way (by user, on command line, or to cout). The line below might compress the file foo.cc storing the compressed version in foo.cc.H.
huff foo.cc foo.cc.HIf your program reads/writes from cin/cout this could be done using:
As described below in the development section, you MUST use a class-based approach to this problem. This means, for example, that you should implement at least one class. Some suggestions are provided in the development section.
You must write a program huff.cc that will compress a file so that it can be uncompressed by unhuff.cc. In writing huff.cc you'll implement at least one class, this class will also provide support for the decompression program unhuff.cc. If compressing a file would result in a file larger than the file being compressed (this is always possible) then no compressed file should be created and a message should be printed indicating that this is the case. The user should have the option of invoking huff with an argument -f which forces compression even when the compressed file is bigger. For example:
where the second time huff is executed the compression is forced because of the -f argument. Only the first argument to the program can be the -f argument. We'll talk about processing command line arguments in class, but see filenames.cc too, you may need to use C-style strings to process command-line arguments.
The program should be robust enough not to crash if it is given a program to uncompress that wasn't compressed using the corresponding Huffman compression program. The robustness of the program will be an important criterion in grading the program. There are a variety of methods that you can use to ensure this works, but it will most likely always be possible to ``fool'' the program.
One easy way to ensure that compression/decompression work in tandem is to write a "magic number" at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding uncompression program first reads N bits, if the bits don't represent the "magic number" then the compressed file is not properly formed.
You'll need to provide more functions and decide on what private data members are needed.
Iterative Enhancement
Some steps that may be useful in developing the program are described below. It's important to develop your program a few steps at a time. At each step, you should have a functioning program, although it may not do everything the first time it's run. By developing in stages, you'll find it easier to isolate bugs and you'll be more likely to get a program working faster.
In the file globals.h the number of characters counted is specified by HUFF_ALPH_SIZE which has value 257. Although only 256 values can be represented by 8 bits, one character is used as the pseudo-EOF character.
Every time a file is compressed the count of the the number of times that pseudo-EOF occurs should be one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.
When a compressed file is written the last bits written should be the bits that correspond to the pseudo-EOF char. You'll have to write these bits explicity. These bits will be recognized by the program unhuff.cc and used in the decompression process. In particular, when using unhuff a well-formed compressed file will be terminated with the encoded form of the pseudo-EOF char (see code above). This means that your decompression program will never actually run out of bits if it's processing a properly compressed file (think about this). In other words, when decompressing (unhuffing) you'll read bits, traverse a tree, and eventually find a leaf-node representing some character. When the pseudo-EOF leaf is found, the program can terminate. If reading a bit fails because there are no more bits (the bitreading function returns false) the compressed file is not well-formed.
In either case, you'll need to either create and use a template.cc file with the proper class instantation for each templated class used in your program (similar to what was done in the word-ladder assignment).
To see how the Readbits routine works, note that the code segment below is functionally equivalent to the Unix cat foo command --- it reads BITS_PER_WORD bits at a time (which is 8 bits as shown in bitops.h) and echoes what is read.
int inbits; ibstream ibs("foo"); while (ibs.Readbits(BITS_PER_WORD,inbits)) { cout.put(inbits); }this code is similar to the loop shown below which uses the standard function to read a char from cin.
char inbits; while (cin.get(inbits)) { cout.put(inbits); }
Note that although Readbits can be called to read a single bit at a time (by setting the first parameter to 1), the second parameter to the function is an int. You'll need to be able to access just one bit of this int (inbits in code above). In order to access just the right-most bit a bitwise and & can be used:
Alternatively, the function KthBit can be used to extract a specific bit from an int --- see the specification in bitops.h and note that the right-most bit is the first bit. You may find it useful in creating Huffman codes to use the shiftleft operator: <<. and the bit-wise or operator: |. Be careful in using shiftleft (and shiftright) because of potential confusion with stream operators. If you fully parenthesize expressions trouble can usually be avoided. An example program using the bitreading classes is provided for you to study, it is called bitread.cc.
When using Writebits to write a specified number of bits, some bits may not be written because of some buffering that takes place. To ensure that all bits are written you should call Flushbits. Note that some buffering is done with Readbits as well, but you shouldn't need to worry about this. The function Flushbits only needs to be called once. If you call Flushbits more than once when writing the compressed file you will likely get erroneous bits when reading them back in.
The array argv is an array of c-style strings. These strings are just pointers to characters, with a special NUL-character '\0' to signify the last character in a C-style string. You do NOT need to know this to manipulate these char *, C-style strings. The easiest thing to do is to assign each element of argv to a C++ string variable as shown in filenames.cc. Then you can use "standard" C++ string functions to manipulate the values, e.g., you can call length(), you can use substr(), you can concatenate strings with +, etc. None of these operations work with C-style, char * strings. Assign each element of argv to a C++ string variable for processing.
When your programs compile and produce the correct output, create a "README" file (please use all capital letters). Include your name, the date, and an estimate of how long you worked on the assignment in the "README" file. You must also include a list of names of all those people (students, prof, tas, tutor) with whom you consulted on the assignment. See the rules for collaboration in the CPS 100 syllabus.
For this assignment only, you are allowed and STRONGLY urged to work closely with someone else in CPS 100 and turn in ONE submission with both names on it. Make sure both names are listed as authors in a comment at the top of each file and also in the README file.
To submit your programs electronically type (where file1 file2 ... are all the .cc and .h files needed by your program):
submit100 assign6 huff.cc unhuff.cc README Makefile file1 file2 ...
The Makefile should be able to make executables huff and unhuff . You need to submit all .h and .cc files that are modified files of any files given to you, plus any others you created.
You should receive a message telling you that the program was submitted correctly. If it doesn't work try typing ~rodger/bin/submit100 in place of submit100 above.