- Most of this page is just notes on Data Manipulation at Scale: Systems and Algorithms.
- Databases: SQLite, MariaDB, MySQL, PostgreSQL
- Tutorials: SQL tutorial, SQL Reference, Tutorial, PostgreSQL Tutorial
- This overview is obviously incomplete.
Relational Algebra and SQL
Relational Algebra is an algebra on tables and consists of the following operators
- Union $\cup$ , Intersection $\cap$ , Difference $-$ ,
- Selection $\sigma$ , Projection $\Pi$ ,
- (Natural) Join $\bowtie$ .
Note that the algebra is closed under the above operators, i.e. the output is always a table.
The relational algebra equipped with the following operators is called extended relational algebra.
- Duplicate Elimination $d$ ,
- Grouping and Aggregation $g$ ,
- Sorting $t$ .
Union, Intersection and Difference
| R1 = |
| R2 = |
| R1 ∪ R2 = |
| R1 ∩ R2 = |
| R1 − R2 = |
|
SELECT * FROM R1 UNION ALL SELECT * FROM R2;: the union R1 $\cup$ R2SELECT * FROM R1 INTERSECT SELECT * FROM R2;: the intersection R1 $\cap$ R2SELECT * FROM R1 EXCEPT SELECT * FROM R2;: the difference R1 $-$ R2
Claim: R1 $\cap$ R2 = R1 $-$ (R1 $-$ R2). The RHS is obtained by
SELECT * FROM R1 EXCEPT SELECT * FROM (SELECT * FROM R1 EXCEPT SELECT * FROM R2);
Selection and Projection
| S = |
| σscore≥60(S) = |
| ∏name,score(S) = |
|
Selection
Pick rows satisfying a certain condition $c$. The result is written as $\sigma_c(R)$. Here $c$ is a map sending a row into a boolean value.
SELECT * FROM S WHERE score >= 60;
We use WHERE c for the selection $\sigma_c$ , not SELECT.
=is used for an equation. Not==.The condition
score >= 50 AND score <= 70can be described simpler with theBETWEENoperator.SELECT * FROM S WHERE score BETWEEN 50 AND 70;For a condition $c$ we can use
LIKEoperator to find a pattern of a strings.SELECT name FROM S WHERE name LIKE '%h_';Here
%means one or more characters and_means a single character.Instead of
LIKEwe can use the regular expression withREGEXPoperator.SELECT name FROM S WHERE name REGEXP '^[aeiouAEIOU]';For a condition $c$ we can use
INoperator to specify multiple values in a column.SELECT * FROM S WHERE score IN ('50','90');To obtain the first n rows in a table, use
LIMIT.SELECT * FROM S LIMIT 50;Use
HAVINGinstead ofWHEREwhen using an aggregate function.SELECT * FROM S
Projection
Restrict a table to the specified columns, say A1, ... , An. The result is written as $\Pi$A1,...,An(R).
SELECT name,score from S;
We use SELECT name,score for the projection $\Pi$name,score.
Join and Cross Product
| S = |
| S2 = |
| S × S2 = |
|
The cross product R1$\times$R2 (a.k.a. the cross join) gives the all possible pair of a row of R1 and a row of R2. The cross proect is obtained by SELECT * FROM S,S2;.
The equi-join R1 $\bowtie$A=B R2 is defined by $\sigma$A=B(R1$\times$R2). Therefore S $\bowtie$ S.name=S2.name S2 is the following table
| id | name | score | name | hight |
| 100 | John | 70 | John | 181.0 |
| 101 | Smith | 50 | Smith | 172.0 |
| 102 | Fred | 90 | Fred | 165.0 |
and we obtain it by SELECT * FROM S,S2 WHERE S.name = S2.name; by definition, or
SELECT * FROM S JOIN S2 ON S.name = S2.name;
The theta-join R1 $\bowtie_\theta$ R2 is an easy extension of the equi-join and defined by $\sigma_\theta$(R1$\times$R2).
Outer joins: left outer join, right outer join, full outer join
Duplicate Elimination and Sort
| S = |
| d∏score(S) = |
| tscore(S) = |
|
Use DISTINCT to eliminate duplicates of rows. The second table is obtained by
SELECT DISTINCT score FROM S;
Use ORDER BY to sort the rows by one or more columns. The third table is obtained by
SELECT * FROM S ORDER BY score;
Note that the default order is the ascending order. Use the DESC keyword to sort the rows in a descending order. For example, SELECT * FROM S ORDER BY score DESC, id DESC; sorts the rows by the columns score and id in descending order .
Groupby
| T = |
| T1 = |
| T2 = |
|
The GROUP BY statement divides a table by a certain column. Applying it together with an aggregate function, we obtain a "summary" table.
Table T1 is the table of the averages of scores by class and obtained by
SELECT class, AVG(score) AS average FROM T GROUP BY class;
Table T2 is the table of the classes in which there is a person whose score > 60. It is obtained by
SELECT class FROM T GROUP BY class HAVING MAX(score) > 60;
Note the HAVING keyword.
Aggregate functions
AGV()gives the mean value.COUNT()gives the number of rows. A typical usage isCOUNT(*).MAX()/MIN()gives the maximum/minimum value in the column.SUM()gives the sum of the values in the column.POWER(,)is used instead of**.CEIL()/FLOOR()REPLACE(col,'a','A'): replace substring with a new one
Some examples:
SELECT *, MAX(score) AS max FROM T GROUP BY class;finds the top score by class.SELECT class, COUNT(*) AS count FROM T GROUP BY class;counts the number of rows in each class.
Comparison between RA, SQL and R (with dplyr)
| RA | SQL | R/dplyr |
| union | UNION | rbind() |
| intersection | INTERSECT | intersect() |
| difference | EXCEPT | anti_join() |
| selection | WHERE | filter() |
| projection | SELECT | select() |
| join | JOIN | merge() |
| duplicate elimination | DISTINCT | distinct() |
| sorting | ORDER BY | arrange() |
| grouping | GROUP BY | group_by() |
Notes for R
- The
sqldfpackage allows us to work with SQL commands on R. Manual, Tutorial 1, Tutorial 2. - The
plyrpackage providesjoin()function. Manual. merge(df,dg,by=NULL)gives a cross join.
Modifying a Table
Add a new row
INSERT INTO Table VALUES (val1,val2,val3,...); INSERT INTO Table (col1,col2,col3,...) VALUES (val1,val2,val3,...);Modify an exsisting row
UPDATE Table SET col1=val1,col2=val2,... WHERE coln=val;Delete a row
DELETE FROM Table WHERE col=val;
Managing Tables
Create a table
CREATE TABLE Table (A TEXT, B TEXT);Creates a virtual table.
CREATE VIEW NewTable AS SELECT * FROM Table WHERE score>60;A virtual table always gives up-to-date data.
Deletes a table
DROP TABLE Table;
Alter TABLE statement
- Tutorial1, Tutorial2.
- According to SQLite FAQ:
(11) How do I add or delete columns from an existing table in SQLite.
SQLite has limited ALTER TABLE support that you can use to add a column to the end of a table or to change the name of a table. If you want to make more complex changes in the structure of a table, you will have to recreate the table. You can save existing data to a temporary table, drop the old table, create the new table, then copy the data back in from the temporary table.
Datatype (schema)
- SQLite: NULL, INTEGER, REAL, TEXT, BLOB.
- MariaDB, MySQL, Overview.
- Integer:
TINYINT,SMALLINT,MEDIUMINT,INT,BIGINT - Text:
TINYTEXT,TEXT,MEDIUMTEXT,LONGTEXT - Blob:
TINYBLOB,BLOB,MEDIUMBLOB,LONGBLOB - Floating-Point:
FLOAT,DOUBLE - Fixed-Point Types:
DECIMAL( , ) - Bit Value Types:
BIT(N) - Date and Time Types:
DATETIME,DATE,TIMESTAMP - CHAR and VARCHAR Types:
CHAR,VARCHAR - BINARY and VARBINARY Types:
BINARY,VARBINARY
- Integer:
Show the information of a table
SQLite:
pragma table_info(Table);sqlite> pragma table_info(S); cid name type notnull dflt_value pk ---------- ---------- ---------- ---------- ---------- ---------- 0 id int 0 0 1 name text 0 0 2 score int 0 0.schemashows the commands which are used to create all tables.- MySQL/MariaDB:
DESCRIBE Table. Manual. This is equivalent toSHOW COLUMNS FROM Table;. More details including character encoding can be obtained bySHOW FULL COLUMNS FROM Table;. About UTF-8 on MySQL (German).
Window functions
- Window functions are not available on SQLite and MySQL.
- Tutorial 1, Tutorial 2, Tutorial 3, window functions in dplyr
A fundamental syntax is:
<window function> OVER (
[PARTITION BY <expression list>]
[ORDER BY <expression [ASC|DESC] list>]
[ROWS|RANGE <window frame>]
)
- The usage of
PARTITION BYis the same as one ofGROUP BY. When using
ORDER BY, the window is going to beROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWAssume we use
PARTITION BYwithORDER BY. To make the range the whole of the corresponding group, use the following rangeROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWINGSuch a frame could be needed when we use a "order sensitive" function such as
first_value().OVER ()means all rowsWe can create an alias of a window by
WINDOW AS w (ORDER BY col).SELECT col, LAG(col,1) OVER w FROM generate_series(1,15,2) as T(col) WINDOW w AS (ORDER BY col);
Window functions
- Every aggregate function (including a user defined aggregate function) can be used as a window function.
- The build-in aggregate functions / window functions of PostgreSQL.
row_number()gives the row index to each row.- Some functions requires
ORDER BY <exp>instead of an expression.rank()/dense_rank(): give the rank according to<exp>percent_rank()percenteil functioncume_dist()gives the cumulative distribution.
In the following example, a table q contains only one column x with values 4, 2, 5, 1, 9, 4, 0, 6.
select x,
row_number() over (),
rank() over w,
dense_rank() over w,
percent_rank() over w,
cume_dist() over w
from q window w as (order by x);
The above statement creates the following table. Note that the order of rows are changed because of the window w.
x | row_number | rank | dense_rank | percent_rank | cume_dist
---+------------+------+------------+-------------------+-----------
0 | 1 | 1 | 1 | 0 | 0.125
1 | 2 | 2 | 2 | 0.142857142857143 | 0.25
2 | 3 | 3 | 3 | 0.285714285714286 | 0.375
4 | 4 | 4 | 4 | 0.428571428571429 | 0.625
4 | 5 | 4 | 4 | 0.428571428571429 | 0.625
5 | 6 | 6 | 5 | 0.714285714285714 | 0.75
6 | 7 | 7 | 6 | 0.857142857142857 | 0.875
9 | 8 | 8 | 7 | 1 | 1