Advance Usage

Command-Line Options

RA provides a number of command-line options. To see a complete list, run RA with command-line flag -h (for help). Some of the most useful ones are listed below.

  • Use -c (--configfile) config_file to specify a RA user configuration file. Among other things, the user configuration file is useful for telling RA how to connect to your own database server. See Configuration File for more details.

  • Use -v (--verbose) to have RA show more information. This flag is useful for debugging, as RA will show you a tree representation of the RA query you entered. Use -d (--debug) if you want to see even more information, e.g., the SQL translation of your RA query.

  • Use -i (--inputfile) input_file to specify a file of statements for RA to run. In this case, RA runs in non-interactive mode (you won’t get the RA prompt), and simply print out the output from executing the statements. Here, to get a more complete record of the execution, you can additionally use -e (--echo) to tell RA to echo all statements it reads to the output as well.

  • Use -o (--outputfile) output_file to specify a file to which RA will write its output (in addition to printing that out). Together with -e, -o will give you a complete record of your RA session, which is especially useful for recording what you did in an interactive session.

  • The optional source argument (without a flag) specifies the data source you want to connect to. It can be the name of a section in your user configuration file (under which RA can find detailed connection information), or the name of the database you are connecting to (assuming default for other connection information). See Configuration File for details.

Data Types and Operators

RA knows about five basic value types: boolean, number, string, date, and datetime. It doesn’t distinguish integers vs. floats, or fixed- vs. variable-length strings.

Note

RA also uses two special types unknown and any; see Type Checking below for how they are useful.

Literals

As with SQL, string literals are enclosed in single quotes, e.g., 'Amy'. If you need to include the single-quote character itself in a string literal, use two in a row, e.g., 'Alivia''s Durham Bistro'.

You can enter date and datetime literals as strings (e.g., '1975-01-01' and '2017-06-01 21:00:00'); RA relies on the underlying database system to convert them implicitly into dates and datetimes (see Type Checking below).

Operators and Functions

  • Comparison operators <, <=, = (equal), <> (not equal), >=, and > work for pairs of numbers, strings, dates, or datetimes, and return a boolean.

  • Arithmetic operators +, -, *, and / work on a pair of numbers, and return a number.

  • Concatenation operator || works on two strings and returns a string.

  • Boolean operators and, or, and not work on boolean values and return a boolean value.

  • As in SQL, operator like matches a string against a pattern (represented as a string). For example, bar.name like 'A%' returns true if bar.name starts with “A” (% in a pattern matches any sequence of 0 or more characters).

The list above is not exhaustive. For more details, please refer to the declarations in the default_functions setting in the default configuration file (this file is not meant to be modified by users); see Specification of Built-in Functions for the format of these declarations.

In addition to the operators above, RA also supports functions with the standard syntax of func_name(arg_1, arg_2,). For example, RA on top of PostgreSQL understands now() (which returns the current datetime) and ceil(1.618) (which returns the ceiling of 1.618). Despite the SQL standard, different database vendors support different SQL functions. For the most part, RA simply translates these function calls verbatim to SQL. However, you can declare a list of functions supported by a specific database vendor in a configuration file (again, see Specification of Built-in Functions for details) so RA can provides additional type-checking (see Type Checking below).

Type Checking

RA assumes that an operator or function takes a number of values as input arguments, and returns a single value as output. Required arguments (possibly none) always come first, followed by optional ones (possibly none). Finally, a function can be declared to take an arbitrary number of additional arguments beyond those listed; these additional arguments all must have the same type as the last one listed.

Here are some examples (see Specification of Built-in Functions for exact syntax and where to specify them):

now() -> datetime

This function takes no input arguments and returns a single datetime value.

substr(string, number, number?) -> string

This function takes a string and a number as its required arguments, and another number as its optional third argument, as indicated by the ? suffix; the function returns a string.

greatest(number, number, number*) -> number

This function takes two or more numbers as input arguments, and returns a number. Here, * indicates zero or more occurrences.

foo(date, string?, number?, any*) -> number

This toy example specifies a function with a required date input argument, optionally followed by a string, a number, and then any number of additional parameters with any type. Here, any is special type to which all data types conform to.

On a high level, RA type-checks each use of operator/function in a query against known declarations. Several rules are worth noting.

Implicit type conversions

RA assumes that the underlying database system can automatically convert datetime to date values, and convert string to date and datetime values. Therefore, RA is able to deduce that drinker.dob <= '1975-01-01' (where attribute drinker.dob has type date) is a valid comparison, because '1975-01-01' can be converted to a date. If the string value doesn’t translate to a valid date, RA relies on the underlying database system to catch the error at run time.

Positional arguments

RA assumes that arguments are “positional.” In other words, for each argument type listed in a declaration, RA expects an argument of a conforming type at the same position. Consequently, if you want to supply a optional argument, then all arguments preceding it in the declaration (including optional arguments) must be supplied. For example, foo('1975-01-01', 'Watergate', 1) conforms to the declaration of foo above, while foo('1975-01-01', 1) does not.

Polymorphic operators/functions

In general, RA allows each operator/function to have multiple signatures. Based on the order in which they appear in configuration files (see Specification of Built-in Functions for details), these declarations are organized as a list. To check an invocation, RA walks down this list. As soon as a RA encounters a declaration that this invocation’s input arguments conform to, RA considers the invocation to be correct and assumes that its return type is the one associated with that declaration.

Warning

An implication of this overly simplistic rule is that the order of declaration matters, unfortunately, sometimes in subtle way. For example, in the sample configuration file for PostgreSQL, declarations for + are listed in the following order:

PLUS(number, number) -> number
PLUS(date, number) -> date      # here number means the number of days
PLUS(number, date) -> date      # ditto

Suppose, for the sake of argument, that an implicit conversion from number to date is possible (it is not in RA). Then, had we instead listed PLUS(date, number) before PLUS(number, number), RA would infer that 1+2 returns a date!

A perhaps better alternative for a future version of RA would be a picking the declaration that “best fits” the given invocation (i.e., requiring the least amount of implicit type conversion).

Unrecognized functions/SQL types

If RA encounters a function f that it doesn’t know about, it assumes that f can take any number of arguments of any type, and that f returns a value of special type unknown. Type unknown is also used when RA encounters an exotic SQL type that it doesn’t know (e.g., a user-defined type for some database column). RA assumes Type unknown can be implicitly converted to any other type. This rule is extremely lenient by design, because different database vendors support many, many different built-in functions that would be impossible for RA to track. This rule essentially allows RA to pass on such expressions to the underlying database system, which will eventually check them in SQL.

Writing Queries

Relation Schema and Attribute References

Note

Before delving into details, let’s start with a bit of motivation. The relational model in its purest form requires distinct attribute names within each relation (because a tuple is considered as a function from attribute names to values), including result relations returned by relational algebra expressions; therefore, each attribute can be referenced uniquely by its name. In practice, however, to simplify expressions, RA (as well as most textbooks and courses that cover relational algebra) allows duplicate attribute names and, in case of confusion, allows them to be distinguished by prefixing them with names of relations where they come from, e.g.:

Drinker \join_{Drinker.name=Frequents.drinker} Frequents;

\select_{Drinker.name=Frequents.drinker}
  (Drinker \cross Frequents);

However, the notion of “relations where they come from” needs clarification when inputs to an operator are themselves complex queries. Moreover, it is possible for attributes to become indistinguishable even if we qualify them with relation names, e.g., in the output of Drinker \cross Drinker;.

Furthermore, the projection operator in relational algebra is often extended to allow output attributes computed by expressions. It is not always obvious how to name such output attributes; as a result, it may become impossible to refer to refer to such an attribute by name.

Therefore, we need to clarify how relation schema and attribute references work in RA.

In RA, a relation can a stored database relation, or the result of executing a relational algebra query (or subquery). The schema of the relation is a list of attribute specifications, while each attribute specification is a triple (rel_name, attr_name, attr_type). Intuitively, rel_name is the name of the relation where this attribute originally comes from. Both rel_name and attr_name can be optional. Within a relation, there is no requirement that attr_names, or even (rel_name, attr_name) pairs, are unique.

Some relational operators allow you to write expressions involving references to attributes in input relations. An attribute reference must be in one of the following forms:

  • attr_name (just the attribute name by itself);

  • rel_name.attr_name (attribute name prefixed by relational name).

For an attribute reference in an operator to be valid, there must be exactly one attribute specification in the input relation schema(s) matching the reference. If there are more than one matching attribute specifications (possibly from different input relation schemas), the attribute reference is ambiguous.

For a stored database relation, RA uses its column names and types for the attribute specifications (mapping SQL data types to RA ones as appropriate; see Data Types and Operators); the rel_names get the name of the database table.

The section on Relational Algebra Operators will further spell out how each relational operator generates its output relation schema from the input relation schema(s). In general, attribute names may no longer stay unique, and you may lose the ability to distinguish or reference certain attributes.

If you turn on the verbose option (-v; see Command-Line Options) when running RA, RA will print out an operator tree for your query and show the schema for each result relation (intermediate or final).

Relational Algebra Operators

This section will not define what these operators do on data—see Basic Usage or refer to standard textbooks. Here, we instead focus on issues more specific to RA.

Selection: \select_{condition} input_relation

Here, condition must have type boolean. The output relation schema is the same as the input one.

Projection: \project_{attr_list} input_relation

Here, attr_list is a comma-separated list of expressions that specifies the output attributes. The output relational schema is the list of output attribute specifications, ordered according to attr_list. For an output attribute, if its expression is simply an attribute reference, its specification will be the same as the specification of the input attribute being referenced. Otherwise, the output attribute specification will have no relation name or attribute name, and its type will be what RA infers for the expression.

Theta-Join: input_relation_1 \join_{cond} input_relation_2

Here, condition must have type boolean. The output relation schema is the concatenation of the two input relation schemas. RA will warn if some attribute from the first input can be potentially confused with some attribute from the second input, which may create a problem when you want to refer to them later in the join output. (Here, RA only warns about ambiguity caused by the join; if either input already contains ambiguously named attributes by itself, they would have been caught and reported earlier.)

Natural join: input_relation_1 \join input_relation_2

RA looks for matches of pairs of attributes—one from each input—with identical (and non-optional) names. RA will generate an error if an attribute is involved in more than one match, or the two attributes involved in a match have types that cannot be equated. If no matches are found, natural join degenerates into cross product, and RA will generate a warning.

For each pair of matching attributes, RA equates them in the join. The output relation schema consists of all attribute specifications from input_relation_1, followed by those from input_relation_2 that are not involved in any match. Consequently, you can still refer to a join attribute by name in the output relation, but prefixing it with the name of input_relation_2 won’t work any more (incidentally, prefixing it with the name of input_relation_1 would still work, but it is not recommended).

Cross product: input_relation_1 \cross input_relation_2

The output relation schema is the concatenation of the two input relation schemas. Again, as with the case of theta-join, RA will warn if some attribute from the first input can be potentially confused with some attribute from the second input.

Set union, difference, and intersection:

input_relation_1 \union input_relation_2

input_relation_1 \diff input_relation_2

input_relation_1 \intersect input_relation_2

Two input relations must have the same number of attributes and every pair of corresponding attributes must have identical types (which is a stronger condition than being able to equate them). RA will warn if some pair of corresponding attributes have different names.

The output relation schema consists of all attribute specifications from input_relation_1. Consequently, you can still refer to attributes by their names from the first input relation, but you lose the ability of referring them by names from the second input relation. (Allowing the later would necessitate remembering multiple possible names for each output attribute, which can get quite confusing; for this reason, RA adopts the convention of always going with the first input relation.)

Rename:

\rename_{new_attr_names} input_relation

\rename_{new_rel_name : *} input_relation

\rename_{ new_rel_name : new_attr_names } input_relation

The output relation schema consists of all attribute specifications from input_relation. If new_rel_name is given, the relation name for all attribute specifications will set to new_rel_name; otherwise the relation name will be unset (the relation names from the input schema will be forgotten). If new_attr_names is given, it will be used to set the attribute names in the output attribute specifications; otherwise the attribute names from the input relation will be retained.

Aggregation and grouping:

\aggr_{aggr_attr_list} input_relation

\aggr_{group_by_attrs: aggr_attr_list} input_relation

Here, group_by_attrs is a comma-separated list of expressions used for grouping tuples in input_relation. Most commonly, they are simply attributes from input_relation. All tuples that agree on the values of group_by_attrs go into the same group. In the form without group_by_attrs, all tuples in input_relation will be in one single group.

aggr_attr_list is a comma-separated list of expressions, each of which must evaluate unambiguously to one value for each group. If an expression in aggr_attr_list references a non-group-by attribute, that attribute must (eventually) be input to an aggregate function, which will compute a single value over all tuples in each group.

The output attributes include group_by_attrs followed by aggr_attr_list, in the order of specification. For an output attribute, if its expression is simply an attribute reference, its specification will be the same as the specification of the input attribute being referenced. Otherwise, the output attribute specification will have no relation name or attribute name, and its type will be what RA infers for the expression.

Views

RA lets you define “views,” which may be thought of as temporary, relation-valued variables holding the results of relational algebra expressions. To define a view, use the syntax:

view_name :- view_definition_query;

Once you define a view, it will show up in \list, and you can use it later in queries (including queries that define other views) as if it’s a database relation. Whenever RA evaluates a query, it basically “expands” any view reference in the query by replacing the view reference with the corresponding view definition query. If that definition query itself contains other view references, RA will expand them recursively. In the end, RA always evaluate a query starting from database relations. The output relation schema from a view reference is the output relation schema for the corresponding view definition query, but with all relation names in all attribute specifications set to the view name.

To undefine a view (i.e., to forget its definition), use the command \clear view_name;. RA won’t let you undefine a view if it is used to define other views. However, you can add ! after \clear to undefine view_name as well as every view dependent on it (directly or indirectly) “by force.” You can also use \clear * to undefine all views.

RA allows you to redefine a view. It will even let you redefine a view v when there are other views defined using v, provided that all (directly or indirectly) dependent view definitions still make sense with the redefined v, and that this redefinition doesn’t lead to any circular definition.

Finally, you can save view definitions to files and load them back into an RA session. The command \save view_name 'file_name'; saves the definition of the given view as well as those of all prerequisite views in the specified file; note that file_name need be surrounded in single quotes. If you do not specify the file name, the default file name will be view_name.ra. You can also use * in place of view_name; in this case RA will save all your views in a file whose name defaults to views.ra. RA will not let you overwrite an existing file, unless you add ! after \save.

Warning

Once you exit from RA, you lose all views you have defined in the session. So make sure you save what you want before exiting!

Other Features

This section describes other useful RA features.

Comments:

RA supports C/C++/Java-style comments. // starts a single-line comment; text following // is considered comment and will be ignored. /* and */ start and end (respectively) a possibly multi-line comment; text between them is considered comment and will be ignored.

Command-line history editing:

On systems that support the GNU Readline, RA provides command-line input history and editing using arrow keys. For example, Up/Down recall previous/next lines, and Left/Right move within the current line.

Executing a script: \source 'ra_file';

This command makes RA read statements from the specified file and execute them. Note that ra_file must be enclosed in single quotes. The file should be just a simple text file containing RA statements and comments. This file can be prepared manually with a text editor, or it can be the result of a \save command.

Executing SQL: \sqlexec_{ sql_statement };

With this command, you can send a SQL statement to the underlying database. RA (and relational algebra) doesn’t have its own language constructs for data definition (such as CREATE TABLE in SQL) or data modification (such as INSERT, DELETE, and UPDATE in SQL), this feature conveniently allows you to do all that without leaving RA. Note that sql_statement should NOT be terminated by ; (instead, ; should terminate the RA command, following }). Some examples of using \sqlexec can be found in the .ra file for creating the sample database.

Configuration File

RA relies on a system configuration file, which you should not modify normally. Then, RA looks for a user configuration file named .radb.ini in your home directory (the definition of “home directory” varies across operating systems). Alternatively, you can specify a user configuration file using the -c option (see Command-Line Options).

The most important use of your user configuration file is to tell RA how to connect to your own database server. Some examples are given in files with .ini suffix in the sample directory, e.g., postgresql.ini. You can copy one of these files to the default user configuration file location and modify it to your liking. See comments in these files for instructions on how to set appropriate values for your database system.

The user configuration file is divided into sections. The first section should be named [DEFAULT]. Subsequent sections can provide additional customization that overrides the default. For example, with the sample postgresql.ini, you will by default connect to PostgreSQL server running locally on your computer, and you will need to give the name of the database to connect to when you run RA (see Command-Line Options).

Note

RA only installs the SQLite driver by default, so if you want to use it for other database systems, you will first need to install the appropriate drivers yourself. For example, to install the PostgreSQL driver, simple use the command pip install psycopg2 on your system (or pip3 if Python 3 is not the default Python version on your system).

With the configuration file, you can do more. Suppose you set db.database under [DEFAULT], and add two more sections as follows:

[DEFAULT]
# ... other settings remain the same...
db.database=beers
# ... other settings remain the same...

[play]
db.database=pokemongo

[production]
db.username=me
db.password=buzzword
db.host=my.server.com
db.port=5432
db.database=mydb

Then, running RA without a source argument will connect you to the beers database on the local server by default; running RA with play as the source argument will connect you to the local pokemongo database (in this case, using pokemongo as source would achieve the same effect); finally, running RA with production as source will connect you to the mydb database running on the remote server, using the credentials supplied under the [production] section.

Another use of the configuration file to declare to RA useful built-in functions support by your database system. This use is discussed further below.

Specification of Built-in Functions

Specifying a built-in function in the configuration file makes RA aware of its signature and allows RA to check its input and out output types (see also Type Checking), which helps catch errors in queries. In fact, RA specifies all operators and functions that are known to work across database vendors in its system configuration file under the setting for default_functions. Additionally, RA looks for a setting for functions in the user configuration file, which can supplement or override specifications in default_functions.

In both settings, each function is declared with a single line of text, with the following format:

fname (req_t1, req_t2,, opt_t1?, opt_t2?,, opt_tlast*) -> return_t

Here, fname is the name of the function; upper-case function names in the system configuration file are reserved for special operators. The (possibly empty) sequence req_t1, req_t2, … specifies the types of required arguments. The (possibly empty) sequence opt_t1, opt_t2, …, each suffixed by ?, specifies the types of optional arguments. The optional opt_tlast suffixed by * specifies that the function can take an arbitrary number of additional arguments of the given type at the end. Finally, return_t specifies the result type of the function. These types can be any of the basic value types supported by RA, as well as any (see Data Types and Operators).

You can also specify a function as an aggregate, by prefixing the declaration line with keyword aggregate: (before fname). An aggregate function takes a (multi)set of input tuples, each with attribute(s) matching the declared argument specification, and computes a single output value of the declared return type. These functions can be used in the aggregation operator \aggr (see Basic Usage).

Comments can be added with #; text following # will be ignored. There cannot be empty lines between function declarations. The order of declarations is important; see Type Checking for details.

Warning

Recall from discussion in Type Checking that in general, a function can have multiple signatures. Suppose your user configuration file gives (at least) one declaration for an operator/function f already declared in the system configuration file. In this case, RA assumes that you intend to wipe out all system default declarations for f, and would like to use only your own declarations for f. (This behavior is necessitated by RA’s simple, order-based resolution discussed in Type Checking, because we need a way for user configuration to customize the order of declarations.) An implication of this rule is that you must remember to declare all alternative signatures for f in your user configuration file.