Lab 8: CPS100e: Unique Words

Change into your cps100e directory and create a lab8 subdirectory. Then change into this directory and copy files by typing what's below (don't forget the trailing dot).

      cp ~ola/cps100e/lab8/* .

Make a link to several text data files. To do this type

    ln -s ~ola/data data

Then you'll be able to use the directory data to access several big text and data files.

If you type "ls" you should see the directory data and the files Makefile, allwords.cc, and test.

Introduction

In this lab you will count the number of times each word occurs in a file, and also the number of unique words in the file. Words are converted to lowercase letters so the word Here and here will be counted as the same word. A word is consecutive non-white-space characters. So the word more. is considered a different word than the word more because the "." is considered part of the first word. For example, if the file test contains the following:

Here are a few words.
and here are a few more.
The output of the program is a list of words and the number of occurences of each word, followed by the number of unique words, the total number of words, and the time to insert all the unique words from the file into the data structure. Output for the file test would be:
enter name of file: test
here		2
are		2
a		2
few		2
words.		1
more.		1

# of unique words = 6
total words (non-unique): 10
total time for update = 0
For this lab, unique words will be stored in a noncircular doubly linked list. Each of the parts in this assignment will look at different algorithms for inserting and updating the linked list. The Node structure is given below.
struct Node
{
    string word;
    int count;       // number of occurences of a word
    Node * next;
    Node * prev;

    Node(string wrd, int cnt, Node * n = 0, Node * p = 0) // constructor
    {
         word = wrd;
         count = cnt;
         next = n;
         prev = p;
    } 
};

Part 1 - Insert at the end

Compile and run the program allwords.cc. This program inserts words into a doubly linked list in the following manner. If the word is already in the list, it increments the count field of the node containing the word. If the word is not in the list, the word is inserted at the end of the list.

For example, if the current list is (not showing counts or myLastNode):

              --------      -------      -------
   myList --> | here | <--> | are | <--> |  a  | --> ||
              --------      -------      -------

Then the word few is inserted at the end of the list.

              --------      -------      -------      -------
   myList --> | here | <--> | are | <--> |  a  | <--> | few | --> ||
              --------      -------      -------      -------
Run the program allwords on the three data files poe.txt, melville.txt, and hamlet.txt (takes a long time). For these three, type "data/" in front of the file name, e.g. "data/poe.txt". The output prints the first 30 words in the linked list, the number of occurences for each of these words, and some additional information. Put the number of unique words, total number of (non-unique) words, and the total time for update for each of these files in a README file.

Part 2 - Insert at the front

Make a copy of allwords.cc to allwords2.cc by typing:

            cp allwords.cc allwords2.cc

(You might want to remove the allwords executable file by typing rm allwords).

In allwords2.cc modify the Update function to the following algorithm. If the word is not in the list, the word is inserted at the front of the list. If the word is already in the list, increment the count field of the node containing the word.

For example, if the current list is:

              --------      -------      -------      
   myList --> | here | <--> | are | <--> | bug  | --> ||
              --------      -------      -------      
and the next word processed is few, then few is inserted at the front of the list.
              --------      --------      -------        -------
   myList --> | few  | <--> | here | <--> | are  | <--> | bug | --> ||
              --------      --------      -------        -------

Compile and run your program on the simple data file test to make sure it works correctly. Then run your program on the three data files poe.txt, melville.txt, and hamlet.txt (takes a long time). For these three, type "data/" in front of the file name, e.g. "data/poe.txt". The output prints the first 30 words in the linked list, the number of occurences for each of these words, and some additional information. Put the number of unique words, total number of (non-unique) words, and the total time for update for each of these files in a README file.

Part 3 - Insert at the end and Move to the front

Make a copy of allwords.cc to allwords3.cc by typing:

            cp allwords.cc allwords3.cc

(You might want to remove the allwords2 executable file by typing rm allwords2).

Modify the Update function in allwords3.cc to the following algorithm. If the word is not in the list, the word is inserted at the end of the list. If the word is already in the list, increment the count field of the node containing the word, and move the word to the front of the list. (Note: if the node is already the first node in the list, it does not need to move.) For example, if the current list is:

              --------      -------      -------        -------
   myList --> | here | <--> | are | <--> |  bug  | <--> | few | --> ||
              --------      -------      -------        -------
and the next word processed is bug , then bug is moved to the front of the list.
              -------      -------       -------      -------
   myList --> | bug | <--> | here | <--> | are | <--> | few | --> ||
              -------      -------       -------      -------
If the next word processed is the , then the is placed at the end of the list.
              -------      -------       -------      -------      -------
   myList --> | bug | <--> | here | <--> | are | <--> | few | <--> | the | --> ||
              -------      -------       -------      -------      -------

Compile and run your program on the simple data file test to make sure it works correctly. Then run your program on the three data files poe.txt, melville.txt, and hamlet.txt (takes a long time). For these three, type "data/" in front of the file name, e.g. "data/poe.txt". The output prints the first 30 words in the linked list, the number of occurences for each of these words, and some additional information. Put the number of unique words, total number of (non-unique) words, and the total time for update for each of these files in a README file.

Part 4 (2pts Extra Credit) - Creep to the front

Make a copy of allwords.cc to allwords4.cc by typing:

            cp allwords.cc allwords4.cc

(You might want to remove the allwords3 executable file by typing rm allwords).

In allwords4.cc modify the Update function to the following algorithm. If the word is not in the list, the word is inserted at the end of the list. If the word is already in the list, increment the count field of the node containing the word, and move the node one node closer to the front of the list.

For example, if the current list is:

              --------      --------      -------  
   myList --> | few  | <--> | here | <--> |  are  | --> ||
              --------      --------      -------  
and the next word processed is bug, then bug is inserted at the end of the list.
              --------      --------      -------        -------
   myList --> | few  | <--> | here | <--> |  are  | <--> | bug | --> ||
              --------      --------      -------        -------
If the next word processed is are, then are is moved one closer to the front of the list.
              --------      --------      -------        -------
   myList --> | few  | <--> | are  | <--> | here  | <--> | bug | --> ||
              --------      --------      -------        -------

Compile and run your program on the simple data file test to make sure it works correctly. Then run your program on the three data files poe.txt, melville.txt, and hamlet.txt (takes a long time). For these three, type "data/" in front of the file name, e.g. "data/poe.txt". The output prints the first 30 words in the linked list, the number of occurences for each of these words, and some additional information. Put the number of unique words, total number of (non-unique) words, and the total time for update for each of these files in a README file.

Submission

To turn in programs: When your modified programs work, you can submit them, along with a README describing how long it took you and what your impressions of the lab are. Type (where N is your section number: 1, 2, or 3):

      submit100e lab8secN README allwords2.cc allwords3.cc allwords4.cc
(leave allwords4.cc off if you did not do the extra credit)

The README file should include your name, how long you worked on the program, and anyone you received significant help from. You should also include any comments you have about the lab.

You should receive a message telling you that the programs were submitted correctly.