Assignment 1 -- grepple


Your goal is to write a program that
  1. reads files into some sort of data structure
  2. interactively accepts commands to query this data structure


Your program should exhibit the same behavior that this man page describes:

grepple - interactive search of file(s) for words

grepple [ -v ] [ -r ] file1 [ file2 ... ]

grepple reads the given files/directories and allows the user to interactively query the resulting database using any of the commands listed in USAGE below. For example (grepple! is the grepple prompt):
> grepple jabberwocky.txt alice.txt
grepple! find brillig
grepple! quit

(verbose) Indicates that the program should also print the line where the word occured. In the above example:
> grepple -v jabberwocky.txt alice.txt
grepple! find brillig
jabberwocky.txt,5:'Twas brillig and the slithy toves
jabberwocky.txt,35:'Twas brillig and the slithy toves
grepple! quit
(recursive) Indicates that any directory named on the command line should be searched recursively for subdirectories, subsubdirectories, and so on. The default behavior is for directories to be searched for files, but not recursively searched.

Here are the commands grepple accepts:
find word
Find word in the database, and print out the results. A word is defined as any whitespace-delimited sequence of characters, and there is no substring matching. Therefore, "find they" will not match the word "they'll".
Multiple occurrences of a word in a line will be printed only once.

read file1 [ file2 ... ]
Read the given files, and insert them into the database. Directories are flat-searched, i.e., only the files within the directory are inserted into the database

recread file1 [ file2 ... ]
Read the given files, and insert them into the database. Directories are recursively-searched (as with -r option)

unread file1 [ file2 ... ]
Delete the given files' info from the database. Directories are flat-unread, only files in the directory are deleted, no subdirectories are read/deleted.

recunread file1 [ file2 ... ]
Delete the given files' info from the database. Directories are recursively-unread

set option
Set the given option. The only currently supported option is verbose.

unset option
Unset the given option. See the set command for supported options.

Quit grepple.

In addition to following the above specification, your program should be robust, i.e. deal with errors in a calm and reasonable manner. For example, if a file that the user specifies doesn't exist, acceptable behavior does not include dumping garbage to the screen, segfaulting, or berating the user mercilessly. Gentle admonishment of the form "jabberwock.txt does not exist" is sufficient.


You will be asked to extend this program in a later assignment, and perhaps add a graphical user interface, or port it to some other language. With this in mind, design will be a large part of the grade for this assignment. You should think carefully about the classes you will need. The design should be able to accomodate requirements for fast queries or for minimal memory usage. This might require re-implementing classes, but not a redesign of the program.

You will want to think carefully about what types of data structures you will use to hold the information that you need.

For example, will you use one hash table to hold all the word information? Will each word have one entry in the hash table, or will each occurrence of a word have an entry? Or will there be one entry per word per file? Or do you want a separate hash table per file?

Or ... More questions: Do you store the text of the lines in your database, or do you go search the files on disk on the occasions that the user asks for verbose output? There are a number of tradeoffs that may influence your decisions. First of all, certain schemes will use more memory than others, and this may affect the maximum number of words/files you can handle. On the other hand, certain schemes will provide faster lookups at the expense of higher memory usage or bigger startup costs (i.e. when reading in a file). One method may make it easy to perform the "unread" operation, but make it harder to do word lookups.

To further influence your decision, here are some example extensions that you may be asked to implement in the future:

save indexfile
Save the current database to file indexfile.

load indexfile
Load a grepple database from the file indexfile.

List the names of files in the current database.

Give some statistics about the current database.

grep pattern
Find all words that match pattern (where pattern is a regular expression).

There is no one right answer. Obviously, it is impossible to address all of the above concerns in the most satisfying manner. The idea is to design your code in such a way that it is easily maintainable, easily extensible, and readable.

Things to keep in mind

The output of this program is specified in the man page above. Your program should act exactly as described. For example, the output of a find command should be a series of lines, each containing the filename, followed by a comma, followed by the line number (no spaces). If the -v option was given on the command line, or if the "verbose" option was set, the line number should be followed by a colon, followed by the text of the line itself. The only thing you may modify is the appearance of the grepple prompt.

Grading criteria for this assignment will be based on:

If there is anything unclear about the specification above, no matter how trivial, post to the newsgroup, and request clarification. When in doubt, ask.

Timeline, Due Dates

Design due Monday, September 9
Prototype Friday, September 13
Final Project due Monday,September 16

An explanation of what to turn in the final delivarables is available.

Simple Grepple Code Example

A simple example using just one file to track line numbers is accessible (on Duke see the class newsgroup and the class directories for information). The code is also accessible here: *