Overview of the relational algebra and SQL

Relational Algebra and SQL

Relational Algebra is an algebra on tables and consists of the following operators

• Union $\cup$ , Intersection $\cap$ , Difference $-$ ,
• Selection $\sigma$ , Projection $\Pi$ ,
• (Natural) Join $\bowtie$ .

Note that the algebra is closed under the above operators, i.e. the output is always a table.

The relational algebra equipped with the following operators is called extended relational algebra.

• Duplicate Elimination $d$ ,
• Grouping and Aggregation $g$ ,
• Sorting $t$ .

Union, Intersection and Difference

R1 =
 A B a1 b1 a2 b1
R2 =
 A B a1 b1 a3 b4
R1 ∪ R2 =
 A B a1 b1 a2 b1 a1 b1 a3 b4
R1 ∩ R2 =
 A B a1 b1
R1 − R2 =
 A B a2 b1
• SELECT * FROM R1 UNION ALL SELECT * FROM R2; : the union R1 $\cup$ R2
• SELECT * FROM R1 INTERSECT SELECT * FROM R2; : the intersection R1 $\cap$ R2
• SELECT * FROM R1 EXCEPT SELECT * FROM R2; : the difference R1 $-$ R2

Claim: R1 $\cap$ R2 = R1 $-$ (R1 $-$ R2). The RHS is obtained by

SELECT * FROM R1 EXCEPT SELECT * FROM (SELECT * FROM R1 EXCEPT SELECT * FROM R2);


Selection and Projection

S =
 id name score 100 John 70 101 Smith 50 102 Fred 90
σscore≥60(S) =
 id name score 100 John 70 102 Fred 90
∏name,score(S) =
 name score John 70 Smith 50 Fred 90

Selection

Pick rows satisfying a certain condition $c$. The result is written as $\sigma_c(R)$. Here $c$ is a map sending a row into a boolean value.

SELECT * FROM S WHERE score >= 60;


We use WHERE c for the selection $\sigma_c$ , not SELECT.

• = is used for an equation. Not ==.
• The condition score >= 50 AND score <= 70 can be described simpler with the BETWEEN operator.

SELECT * FROM S WHERE score BETWEEN 50 AND 70;

• For a condition $c$ we can use LIKE operator to find a pattern of a strings.

SELECT name FROM S WHERE name LIKE '%h_';


Here % means one or more characters and _ means a single character.

• Instead of LIKE we can use the regular expression with REGEXP operator.

SELECT name FROM S WHERE name REGEXP '^[aeiouAEIOU]';

• For a condition $c$ we can use IN operator to specify multiple values in a column.

SELECT * FROM S WHERE score IN ('50','90');

• To obtain the first n rows in a table, use LIMIT.

SELECT * FROM S LIMIT 50;

• Use HAVING instead of WHERE when using an aggregate function.

SELECT * FROM S


Projection

Restrict a table to the specified columns, say A1, ... , An. The result is written as $\Pi$A1,...,An(R).

SELECT name,score from S;


We use SELECT name,score for the projection $\Pi$name,score.

Join and Cross Product

S =
 id name score 100 John 70 101 Smith 50 102 Fred 90
S2 =
 name hight Fred 165.0 John 181.0 Smith 172.0
S × S2 =
 id name score name hight 100 John 70 Fred 165.0 100 John 70 John 181.0 100 John 70 Smith 172.0 101 Smith 50 Fred 165.0 101 Smith 50 John 181.0 101 Smith 50 Smith 172.0 102 Fred 90 Fred 165.0 102 Fred 90 John 181.0 102 Fred 90 Smith 172.0

The cross product R1$\times$R2 (a.k.a. the cross join) gives the all possible pair of a row of R1 and a row of R2. The cross proect is obtained by SELECT * FROM S,S2;.

The equi-join R1 $\bowtie$A=B R2 is defined by $\sigma$A=B(R1$\times$R2). Therefore S $\bowtie$ S.name=S2.name S2 is the following table

 id name score name hight 100 John 70 John 181.0 101 Smith 50 Smith 172.0 102 Fred 90 Fred 165.0

and we obtain it by SELECT * FROM S,S2 WHERE S.name = S2.name; by definition, or

SELECT * FROM S JOIN S2 ON S.name = S2.name;


The theta-join R1 $\bowtie_\theta$ R2 is an easy extension of the equi-join and defined by $\sigma_\theta$(R1$\times$R2).

Outer joins: left outer join, right outer join, full outer join

Duplicate Elimination and Sort

S =
 id name score 100 John 70 101 Smith 50 102 Fred 90 103 Anna 90 104 Maria 50
d∏score(S) =
 score 70 50 90
tscore(S) =
 id name score 101 Smith 50 104 Maria 50 100 John 70 102 Fred 90 103 Anna 90

Use DISTINCT to eliminate duplicates of rows. The second table is obtained by

SELECT DISTINCT score FROM S;


Use ORDER BY to sort the rows by one or more columns. The third table is obtained by

SELECT * FROM S ORDER BY score;


Note that the default order is the ascending order. Use the DESC keyword to sort the rows in a descending order. For example, SELECT * FROM S ORDER BY score DESC, id DESC; sorts the rows by the columns score and id in descending order .

Groupby

T =
 class name score a Fred 55 a Maria 99 b John 60 b Julia 82 c Britta 50 c Smith 50
T1 =
 class average a 77.0 b 71.0 c 50
T2 =
 class a b

The GROUP BY statement divides a table by a certain column. Applying it together with an aggregate function, we obtain a "summary" table.

Table T1 is the table of the averages of scores by class and obtained by

SELECT class, AVG(score) AS average FROM T GROUP BY class;


Table T2 is the table of the classes in which there is a person whose score > 60. It is obtained by

SELECT class FROM T GROUP BY class HAVING MAX(score) > 60;


Note the HAVING keyword.

Aggregate functions

• AGV() gives the mean value.
• COUNT() gives the number of rows. A typical usage is COUNT(*).
• MAX()/MIN() gives the maximum/minimum value in the column.
• SUM() gives the sum of the values in the column.
• POWER(,) is used instead of **.
• CEIL()/FLOOR()
• REPLACE(col,'a','A') : replace substring with a new one

Some examples:

• SELECT *, MAX(score) AS max FROM T GROUP BY class; finds the top score by class.
• SELECT class, COUNT(*) AS count FROM T GROUP BY class; counts the number of rows in each class.

Comparison between RA, SQL and R (with dplyr)

 RA SQL R/dplyr union UNION rbind() intersection INTERSECT intersect() difference EXCEPT anti_join() selection WHERE filter() projection SELECT select() join JOIN merge() duplicate elimination DISTINCT distinct() sorting ORDER BY arrange() grouping GROUP BY group_by()

Notes for R

• The sqldf package allows us to work with SQL commands on R. Manual, Tutorial 1, Tutorial 2.
• The plyr package provides join() function. Manual.
• merge(df,dg,by=NULL) gives a cross join.

Modifying a Table

INSERT INTO Table VALUES (val1,val2,val3,...);
INSERT INTO Table (col1,col2,col3,...) VALUES (val1,val2,val3,...);

• Modify an exsisting row

UPDATE Table SET col1=val1,col2=val2,... WHERE coln=val;

• Delete a row

DELETE FROM Table WHERE col=val;


Managing Tables

• Create a table

CREATE TABLE Table (A TEXT, B TEXT);

• Creates a virtual table.

CREATE VIEW NewTable AS SELECT * FROM Table WHERE score>60;


A virtual table always gives up-to-date data.

• Deletes a table

DROP TABLE Table;


Alter TABLE statement

• Tutorial1, Tutorial2.
• According to SQLite FAQ: (11) How do I add or delete columns from an existing table in SQLite.

SQLite has limited ALTER TABLE support that you can use to add a column to the end of a table or to change the name of a table. If you want to make more complex changes in the structure of a table, you will have to recreate the table. You can save existing data to a temporary table, drop the old table, create the new table, then copy the data back in from the temporary table.

Datatype (schema)

• SQLite: NULL, INTEGER, REAL, TEXT, BLOB.
• Integer: TINYINT, SMALLINT, MEDIUMINT, INT, BIGINT
• Text: TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT
• Blob: TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB
• Floating-Point: FLOAT, DOUBLE
• Fixed-Point Types: DECIMAL( , )
• Bit Value Types: BIT(N)
• Date and Time Types: DATETIME, DATE, TIMESTAMP
• CHAR and VARCHAR Types: CHAR, VARCHAR
• BINARY and VARBINARY Types: BINARY, VARBINARY

Show the information of a table

• SQLite: pragma table_info(Table);

sqlite> pragma table_info(S);
cid         name        type        notnull     dflt_value  pk
----------  ----------  ----------  ----------  ----------  ----------
0           id          int         0                       0
1           name        text        0                       0
2           score       int         0                       0


.schema shows the commands which are used to create all tables.

• MySQL/MariaDB: DESCRIBE Table. Manual. This is equivalent to SHOW COLUMNS FROM Table;. More details including character encoding can be obtained by SHOW FULL COLUMNS FROM Table;. About UTF-8 on MySQL (German).

Window functions

A fundamental syntax is:

<window function> OVER (
[PARTITION BY <expression list>]
[ORDER BY <expression [ASC|DESC] list>]
[ROWS|RANGE <window frame>]
)

• The usage of PARTITION BY is the same as one of GROUP BY.
• When using ORDER BY, the window is going to be

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW


Assume we use PARTITION BY with ORDER BY. To make the range the whole of the corresponding group, use the following range

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING


Such a frame could be needed when we use a "order sensitive" function such as first_value().

• OVER () means all rows
• We can create an alias of a window by WINDOW AS w (ORDER BY col).

SELECT col, LAG(col,1) OVER w FROM generate_series(1,15,2) as T(col) WINDOW w AS (ORDER BY col);


Window functions

• Every aggregate function (including a user defined aggregate function) can be used as a window function.
• The build-in aggregate functions / window functions of PostgreSQL.
• row_number() gives the row index to each row.
• Some functions requires ORDER BY <exp> instead of an expression.
• rank() / dense_rank() : give the rank according to <exp>
• percent_rank() percenteil function
• cume_dist() gives the cumulative distribution.

In the following example, a table q contains only one column x with values 4, 2, 5, 1, 9, 4, 0, 6.

select x,
row_number() over (),
rank() over w,
dense_rank() over w,
percent_rank() over w,
cume_dist() over w
from q window w as (order by x);


The above statement creates the following table. Note that the order of rows are changed because of the window w.

 x | row_number | rank | dense_rank |   percent_rank    | cume_dist
---+------------+------+------------+-------------------+-----------
0 |          1 |    1 |          1 |                 0 |     0.125
1 |          2 |    2 |          2 | 0.142857142857143 |      0.25
2 |          3 |    3 |          3 | 0.285714285714286 |     0.375
4 |          4 |    4 |          4 | 0.428571428571429 |     0.625
4 |          5 |    4 |          4 | 0.428571428571429 |     0.625
5 |          6 |    6 |          5 | 0.714285714285714 |      0.75
6 |          7 |    7 |          6 | 0.857142857142857 |     0.875
9 |          8 |    8 |          7 |                 1 |         1