Overview of the relational algebra and SQL [c r e s c e n t]

Overview of the relational algebra and SQL

Last Modified: 2017-06-17

Most of this page is just notes on Data Manipulation at Scale: Systems and Algorithms.
Databases: SQLite, MariaDB, MySQL, PostgreSQL
Tutorials: SQL tutorial, SQL Reference, Tutorial, PostgreSQL Tutorial
This overview is obviously incomplete.

Relational Algebra and SQL

Relational Algebra is an algebra on tables and consists of the following operators

Union $\cup$ , Intersection $\cap$ , Difference $-$ ,
Selection $\sigma$ , Projection $\Pi$ ,
(Natural) Join $\bowtie$ .

Note that the algebra is closed under the above operators, i.e. the output is always a table.

The relational algebra equipped with the following operators is called extended relational algebra.

Duplicate Elimination $d$ ,
Grouping and Aggregation $g$ ,
Sorting $t$ .

Union, Intersection and Difference

R1 =

A	B
a1	b1
a2	b1

R2 =

A	B
a1	b1
a3	b4

R1 ∪ R2 =

A	B
a1	b1
a2	b1
a1	b1
a3	b4

R1 ∩ R2 =

A	B
a1	b1

R1 − R2 =

A	B
a2	b1

SELECT * FROM R1 UNION ALL SELECT * FROM R2; : the union R1 $\cup$ R2
SELECT * FROM R1 INTERSECT SELECT * FROM R2; : the intersection R1 $\cap$ R2
SELECT * FROM R1 EXCEPT SELECT * FROM R2; : the difference R1 $-$ R2

Claim: R1 $\cap$ R2 = R1 $-$ (R1 $-$ R2). The RHS is obtained by

SELECT * FROM R1 EXCEPT SELECT * FROM (SELECT * FROM R1 EXCEPT SELECT * FROM R2);

Selection and Projection

S =

id	name	score
100	John	70
101	Smith	50
102	Fred	90

σ_score≥60(S) =

id	name	score
100	John	70
102	Fred	90

∏_name,score(S) =

name	score
John	70
Smith	50
Fred	90

Selection

Pick rows satisfying a certain condition $c$. The result is written as $\sigma_c(R)$. Here $c$ is a map sending a row into a boolean value.

SELECT * FROM S WHERE score >= 60;

We use WHERE c for the selection $\sigma_c$ , not SELECT.

= is used for an equation. Not ==.
The condition score >= 50 AND score <= 70 can be described simpler with the BETWEEN operator.
```
SELECT * FROM S WHERE score BETWEEN 50 AND 70;
```
For a condition $c$ we can use LIKE operator to find a pattern of a strings.
```
SELECT name FROM S WHERE name LIKE '%h_';
```
Here % means one or more characters and _ means a single character.
Instead of LIKE we can use the regular expression with REGEXP operator.
```
SELECT name FROM S WHERE name REGEXP '^[aeiouAEIOU]';
```
For a condition $c$ we can use IN operator to specify multiple values in a column.
```
SELECT * FROM S WHERE score IN ('50','90');
```
To obtain the first n rows in a table, use LIMIT.
```
SELECT * FROM S LIMIT 50;
```
Use HAVING instead of WHERE when using an aggregate function.
```
SELECT * FROM S 
```

Projection

Restrict a table to the specified columns, say A1, ... , An. The result is written as $\Pi$_A1,...,An(R).

SELECT name,score from S;

We use SELECT name,score for the projection $\Pi$_name,score.

Join and Cross Product

S =

id	name	score
100	John	70
101	Smith	50
102	Fred	90

S2 =

name	hight
Fred	165.0
John	181.0
Smith	172.0

S × S2 =

id	name	score	name	hight
100	John	70	Fred	165.0
100	John	70	John	181.0
100	John	70	Smith	172.0
101	Smith	50	Fred	165.0
101	Smith	50	John	181.0
101	Smith	50	Smith	172.0
102	Fred	90	Fred	165.0
102	Fred	90	John	181.0
102	Fred	90	Smith	172.0

The cross product R1$\times$R2 (a.k.a. the cross join) gives the all possible pair of a row of R1 and a row of R2. The cross proect is obtained by SELECT * FROM S,S2;.

The equi-join R1 $\bowtie$_A=B R2 is defined by $\sigma$_A=B(R1$\times$R2). Therefore S $\bowtie$ _{S.name=S2.name} S2 is the following table

id	name	score	name	hight
100	John	70	John	181.0
101	Smith	50	Smith	172.0
102	Fred	90	Fred	165.0

and we obtain it by SELECT * FROM S,S2 WHERE S.name = S2.name; by definition, or

SELECT * FROM S JOIN S2 ON S.name = S2.name;

The theta-join R1 $\bowtie_\theta$ R2 is an easy extension of the equi-join and defined by $\sigma_\theta$(R1$\times$R2).

Outer joins: left outer join, right outer join, full outer join

Duplicate Elimination and Sort

S =

id	name	score
100	John	70
101	Smith	50
102	Fred	90
103	Anna	90
104	Maria	50

d∏_score(S) =

score
70
50
90

t_score(S) =

id	name	score
101	Smith	50
104	Maria	50
100	John	70
102	Fred	90
103	Anna	90

Use DISTINCT to eliminate duplicates of rows. The second table is obtained by

SELECT DISTINCT score FROM S;

Use ORDER BY to sort the rows by one or more columns. The third table is obtained by

SELECT * FROM S ORDER BY score;

Note that the default order is the ascending order. Use the DESC keyword to sort the rows in a descending order. For example, SELECT * FROM S ORDER BY score DESC, id DESC; sorts the rows by the columns score and id in descending order .

Groupby

T =

class	name	score
a	Fred	55
a	Maria	99
b	John	60
b	Julia	82
c	Britta	50
c	Smith	50

T1 =

class	average
a	77.0
b	71.0
c	50

T2 =

class
a
b

The GROUP BY statement divides a table by a certain column. Applying it together with an aggregate function, we obtain a "summary" table.

Table T1 is the table of the averages of scores by class and obtained by

SELECT class, AVG(score) AS average FROM T GROUP BY class;

Table T2 is the table of the classes in which there is a person whose score > 60. It is obtained by

SELECT class FROM T GROUP BY class HAVING MAX(score) > 60;

Note the HAVING keyword.

Aggregate functions

AGV() gives the mean value.
COUNT() gives the number of rows. A typical usage is COUNT(*).
MAX()/MIN() gives the maximum/minimum value in the column.
SUM() gives the sum of the values in the column.
POWER(,) is used instead of **.
CEIL()/FLOOR()
REPLACE(col,'a','A') : replace substring with a new one

Some examples:

SELECT *, MAX(score) AS max FROM T GROUP BY class; finds the top score by class.
SELECT class, COUNT(*) AS count FROM T GROUP BY class; counts the number of rows in each class.

Comparison between RA, SQL and R (with dplyr)

RA	SQL	R/dplyr
union	UNION	rbind()
intersection	INTERSECT	intersect()
difference	EXCEPT	anti_join()
selection	WHERE	filter()
projection	SELECT	select()
join	JOIN	merge()
duplicate elimination	DISTINCT	distinct()
sorting	ORDER BY	arrange()
grouping	GROUP BY	group_by()

Notes for R

The sqldf package allows us to work with SQL commands on R. Manual, Tutorial 1, Tutorial 2.
The plyr package provides join() function. Manual.
merge(df,dg,by=NULL) gives a cross join.

Modifying a Table

Add a new row

INSERT INTO Table VALUES (val1,val2,val3,...);
INSERT INTO Table (col1,col2,col3,...) VALUES (val1,val2,val3,...);

Modify an exsisting row

UPDATE Table SET col1=val1,col2=val2,... WHERE coln=val;

Delete a row
```
DELETE FROM Table WHERE col=val;
```

Managing Tables

Create a table
```
CREATE TABLE Table (A TEXT, B TEXT);
```
Creates a virtual table.
```
CREATE VIEW NewTable AS SELECT * FROM Table WHERE score>60;
```
A virtual table always gives up-to-date data.
Deletes a table
```
DROP TABLE Table;
```

Alter TABLE statement

Tutorial1, Tutorial2.
According to SQLite FAQ: (11) How do I add or delete columns from an existing table in SQLite.

SQLite has limited ALTER TABLE support that you can use to add a column to the end of a table or to change the name of a table. If you want to make more complex changes in the structure of a table, you will have to recreate the table. You can save existing data to a temporary table, drop the old table, create the new table, then copy the data back in from the temporary table.

Datatype (schema)

SQLite: NULL, INTEGER, REAL, TEXT, BLOB.
MariaDB, MySQL, Overview.
- Integer: TINYINT, SMALLINT, MEDIUMINT, INT, BIGINT
- Text: TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT
- Blob: TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB
- Floating-Point: FLOAT, DOUBLE
- Fixed-Point Types: DECIMAL( , )
- Bit Value Types: BIT(N)
- Date and Time Types: DATETIME, DATE, TIMESTAMP
- CHAR and VARCHAR Types: CHAR, VARCHAR
- BINARY and VARBINARY Types: BINARY, VARBINARY

Show the information of a table

SQLite: pragma table_info(Table);

sqlite> pragma table_info(S);
cid         name        type        notnull     dflt_value  pk        
----------  ----------  ----------  ----------  ----------  ----------
0           id          int         0                       0         
1           name        text        0                       0         
2           score       int         0                       0

.schema shows the commands which are used to create all tables.

MySQL/MariaDB: DESCRIBE Table. Manual. This is equivalent to SHOW COLUMNS FROM Table;. More details including character encoding can be obtained by SHOW FULL COLUMNS FROM Table;. About UTF-8 on MySQL (German).

Window functions

Window functions are not available on SQLite and MySQL.
Tutorial 1, Tutorial 2, Tutorial 3, window functions in dplyr

A fundamental syntax is:

<window function> OVER (
    [PARTITION BY <expression list>]
    [ORDER BY <expression [ASC|DESC] list>]
    [ROWS|RANGE <window frame>]
)

The usage of PARTITION BY is the same as one of GROUP BY.
When using ORDER BY, the window is going to be
```
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
```
Assume we use PARTITION BY with ORDER BY. To make the range the whole of the corresponding group, use the following range
```
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
```
Such a frame could be needed when we use a "order sensitive" function such as first_value().
OVER () means all rows

We can create an alias of a window by WINDOW AS w (ORDER BY col).

SELECT col, LAG(col,1) OVER w FROM generate_series(1,15,2) as T(col) WINDOW w AS (ORDER BY col);

Window functions

Every aggregate function (including a user defined aggregate function) can be used as a window function.
The build-in aggregate functions / window functions of PostgreSQL.
row_number() gives the row index to each row.
Some functions requires ORDER BY <exp> instead of an expression.
- rank() / dense_rank() : give the rank according to <exp>
- percent_rank() percenteil function
- cume_dist() gives the cumulative distribution.

In the following example, a table q contains only one column x with values 4, 2, 5, 1, 9, 4, 0, 6.

select x,
    row_number() over (),
    rank() over w,
    dense_rank() over w,
    percent_rank() over w,
    cume_dist() over w
from q window w as (order by x);

The above statement creates the following table. Note that the order of rows are changed because of the window w.

 x | row_number | rank | dense_rank |   percent_rank    | cume_dist 
---+------------+------+------------+-------------------+-----------
 0 |          1 |    1 |          1 |                 0 |     0.125
 1 |          2 |    2 |          2 | 0.142857142857143 |      0.25
 2 |          3 |    3 |          3 | 0.285714285714286 |     0.375
 4 |          4 |    4 |          4 | 0.428571428571429 |     0.625
 4 |          5 |    4 |          4 | 0.428571428571429 |     0.625
 5 |          6 |    6 |          5 | 0.714285714285714 |      0.75
 6 |          7 |    7 |          6 | 0.857142857142857 |     0.875
 9 |          8 |    8 |          7 |                 1 |         1

Share this page on