- Most of this page is just notes on Data Manipulation at Scale: Systems and Algorithms.
- Databases: SQLite, MariaDB, MySQL, PostgreSQL
- Tutorials: SQL tutorial, SQL Reference, Tutorial, PostgreSQL Tutorial
- This overview is obviously incomplete.
Relational Algebra and SQL
Relational Algebra is an algebra on tables and consists of the following operators
- Union $\cup$ , Intersection $\cap$ , Difference $-$ ,
- Selection $\sigma$ , Projection $\Pi$ ,
- (Natural) Join $\bowtie$ .
Note that the algebra is closed under the above operators, i.e. the output is always a table.
The relational algebra equipped with the following operators is called extended relational algebra.
- Duplicate Elimination $d$ ,
- Grouping and Aggregation $g$ ,
- Sorting $t$ .
Union, Intersection and Difference
R1 = |
| R2 = |
| R1 ∪ R2 = |
| R1 ∩ R2 = |
| R1 − R2 = |
|
SELECT * FROM R1 UNION ALL SELECT * FROM R2;
: the union R1 $\cup$ R2SELECT * FROM R1 INTERSECT SELECT * FROM R2;
: the intersection R1 $\cap$ R2SELECT * FROM R1 EXCEPT SELECT * FROM R2;
: the difference R1 $-$ R2
Claim: R1 $\cap$ R2 = R1 $-$ (R1 $-$ R2). The RHS is obtained by
SELECT * FROM R1 EXCEPT SELECT * FROM (SELECT * FROM R1 EXCEPT SELECT * FROM R2);
Selection and Projection
S = |
| σscore≥60(S) = |
| ∏name,score(S) = |
|
Selection
Pick rows satisfying a certain condition $c$. The result is written as $\sigma_c(R)$. Here $c$ is a map sending a row into a boolean value.
SELECT * FROM S WHERE score >= 60;
We use WHERE c
for the selection $\sigma_c$ , not SELECT
.
=
is used for an equation. Not==
.The condition
score >= 50 AND score <= 70
can be described simpler with theBETWEEN
operator.SELECT * FROM S WHERE score BETWEEN 50 AND 70;
For a condition $c$ we can use
LIKE
operator to find a pattern of a strings.SELECT name FROM S WHERE name LIKE '%h_';
Here
%
means one or more characters and_
means a single character.Instead of
LIKE
we can use the regular expression withREGEXP
operator.SELECT name FROM S WHERE name REGEXP '^[aeiouAEIOU]';
For a condition $c$ we can use
IN
operator to specify multiple values in a column.SELECT * FROM S WHERE score IN ('50','90');
To obtain the first n rows in a table, use
LIMIT
.SELECT * FROM S LIMIT 50;
Use
HAVING
instead ofWHERE
when using an aggregate function.SELECT * FROM S
Projection
Restrict a table to the specified columns, say A1, ... , An. The result is written as $\Pi$A1,...,An(R).
SELECT name,score from S;
We use SELECT name,score
for the projection $\Pi$name,score.
Join and Cross Product
S = |
| S2 = |
| S × S2 = |
|
The cross product R1$\times$R2 (a.k.a. the cross join) gives the all possible pair of a row of R1 and a row of R2. The cross proect is obtained by SELECT * FROM S,S2;
.
The equi-join R1 $\bowtie$A=B R2 is defined by $\sigma$A=B(R1$\times$R2). Therefore S $\bowtie$ S.name=S2.name S2 is the following table
id | name | score | name | hight |
100 | John | 70 | John | 181.0 |
101 | Smith | 50 | Smith | 172.0 |
102 | Fred | 90 | Fred | 165.0 |
and we obtain it by SELECT * FROM S,S2 WHERE S.name = S2.name;
by definition, or
SELECT * FROM S JOIN S2 ON S.name = S2.name;
The theta-join R1 $\bowtie_\theta$ R2 is an easy extension of the equi-join and defined by $\sigma_\theta$(R1$\times$R2).
Outer joins: left outer join, right outer join, full outer join
Duplicate Elimination and Sort
S = |
| d∏score(S) = |
| tscore(S) = |
|
Use DISTINCT
to eliminate duplicates of rows. The second table is obtained by
SELECT DISTINCT score FROM S;
Use ORDER BY
to sort the rows by one or more columns. The third table is obtained by
SELECT * FROM S ORDER BY score;
Note that the default order is the ascending order. Use the DESC
keyword to sort the rows in a descending order. For example, SELECT * FROM S ORDER BY score DESC, id DESC;
sorts the rows by the columns score and id in descending order .
Groupby
T = |
| T1 = |
| T2 = |
|
The GROUP BY
statement divides a table by a certain column. Applying it together with an aggregate function, we obtain a "summary" table.
Table T1 is the table of the averages of scores by class and obtained by
SELECT class, AVG(score) AS average FROM T GROUP BY class;
Table T2 is the table of the classes in which there is a person whose score > 60. It is obtained by
SELECT class FROM T GROUP BY class HAVING MAX(score) > 60;
Note the HAVING
keyword.
Aggregate functions
AGV()
gives the mean value.COUNT()
gives the number of rows. A typical usage isCOUNT(*)
.MAX()
/MIN()
gives the maximum/minimum value in the column.SUM()
gives the sum of the values in the column.POWER(,)
is used instead of**
.CEIL()
/FLOOR()
REPLACE(col,'a','A')
: replace substring with a new one
Some examples:
SELECT *, MAX(score) AS max FROM T GROUP BY class;
finds the top score by class.SELECT class, COUNT(*) AS count FROM T GROUP BY class;
counts the number of rows in each class.
Comparison between RA, SQL and R (with dplyr)
RA | SQL | R/dplyr |
union | UNION | rbind() |
intersection | INTERSECT | intersect() |
difference | EXCEPT | anti_join() |
selection | WHERE | filter() |
projection | SELECT | select() |
join | JOIN | merge() |
duplicate elimination | DISTINCT | distinct() |
sorting | ORDER BY | arrange() |
grouping | GROUP BY | group_by() |
Notes for R
- The
sqldf
package allows us to work with SQL commands on R. Manual, Tutorial 1, Tutorial 2. - The
plyr
package providesjoin()
function. Manual. merge(df,dg,by=NULL)
gives a cross join.
Modifying a Table
Add a new row
INSERT INTO Table VALUES (val1,val2,val3,...); INSERT INTO Table (col1,col2,col3,...) VALUES (val1,val2,val3,...);
Modify an exsisting row
UPDATE Table SET col1=val1,col2=val2,... WHERE coln=val;
Delete a row
DELETE FROM Table WHERE col=val;
Managing Tables
Create a table
CREATE TABLE Table (A TEXT, B TEXT);
Creates a virtual table.
CREATE VIEW NewTable AS SELECT * FROM Table WHERE score>60;
A virtual table always gives up-to-date data.
Deletes a table
DROP TABLE Table;
Alter TABLE statement
- Tutorial1, Tutorial2.
- According to SQLite FAQ:
(11) How do I add or delete columns from an existing table in SQLite.
SQLite has limited ALTER TABLE support that you can use to add a column to the end of a table or to change the name of a table. If you want to make more complex changes in the structure of a table, you will have to recreate the table. You can save existing data to a temporary table, drop the old table, create the new table, then copy the data back in from the temporary table.
Datatype (schema)
- SQLite: NULL, INTEGER, REAL, TEXT, BLOB.
- MariaDB, MySQL, Overview.
- Integer:
TINYINT
,SMALLINT
,MEDIUMINT
,INT
,BIGINT
- Text:
TINYTEXT
,TEXT
,MEDIUMTEXT
,LONGTEXT
- Blob:
TINYBLOB
,BLOB
,MEDIUMBLOB
,LONGBLOB
- Floating-Point:
FLOAT
,DOUBLE
- Fixed-Point Types:
DECIMAL( , )
- Bit Value Types:
BIT(N)
- Date and Time Types:
DATETIME
,DATE
,TIMESTAMP
- CHAR and VARCHAR Types:
CHAR
,VARCHAR
- BINARY and VARBINARY Types:
BINARY
,VARBINARY
- Integer:
Show the information of a table
SQLite:
pragma table_info(Table);
sqlite> pragma table_info(S); cid name type notnull dflt_value pk ---------- ---------- ---------- ---------- ---------- ---------- 0 id int 0 0 1 name text 0 0 2 score int 0 0
.schema
shows the commands which are used to create all tables.- MySQL/MariaDB:
DESCRIBE Table
. Manual. This is equivalent toSHOW COLUMNS FROM Table;
. More details including character encoding can be obtained bySHOW FULL COLUMNS FROM Table;
. About UTF-8 on MySQL (German).
Window functions
- Window functions are not available on SQLite and MySQL.
- Tutorial 1, Tutorial 2, Tutorial 3, window functions in dplyr
A fundamental syntax is:
<window function> OVER (
[PARTITION BY <expression list>]
[ORDER BY <expression [ASC|DESC] list>]
[ROWS|RANGE <window frame>]
)
- The usage of
PARTITION BY
is the same as one ofGROUP BY
. When using
ORDER BY
, the window is going to beROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Assume we use
PARTITION BY
withORDER BY
. To make the range the whole of the corresponding group, use the following rangeROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Such a frame could be needed when we use a "order sensitive" function such as
first_value()
.OVER ()
means all rowsWe can create an alias of a window by
WINDOW AS w (ORDER BY col)
.SELECT col, LAG(col,1) OVER w FROM generate_series(1,15,2) as T(col) WINDOW w AS (ORDER BY col);
Window functions
- Every aggregate function (including a user defined aggregate function) can be used as a window function.
- The build-in aggregate functions / window functions of PostgreSQL.
row_number()
gives the row index to each row.- Some functions requires
ORDER BY <exp>
instead of an expression.rank()
/dense_rank()
: give the rank according to<exp>
percent_rank()
percenteil functioncume_dist()
gives the cumulative distribution.
In the following example, a table q
contains only one column x
with values 4, 2, 5, 1, 9, 4, 0, 6.
select x,
row_number() over (),
rank() over w,
dense_rank() over w,
percent_rank() over w,
cume_dist() over w
from q window w as (order by x);
The above statement creates the following table. Note that the order of rows are changed because of the window w
.
x | row_number | rank | dense_rank | percent_rank | cume_dist
---+------------+------+------------+-------------------+-----------
0 | 1 | 1 | 1 | 0 | 0.125
1 | 2 | 2 | 2 | 0.142857142857143 | 0.25
2 | 3 | 3 | 3 | 0.285714285714286 | 0.375
4 | 4 | 4 | 4 | 0.428571428571429 | 0.625
4 | 5 | 4 | 4 | 0.428571428571429 | 0.625
5 | 6 | 6 | 5 | 0.714285714285714 | 0.75
6 | 7 | 7 | 6 | 0.857142857142857 | 0.875
9 | 8 | 8 | 7 | 1 | 1