Set operations with SQL
Yao-Jen Kuo yaojenkuo@ntu.edu.tw from DATAINPOINT
%LOAD sqlite3 db=data/taiwan_election_2024.db timeout=2 shared_cache=true
In a database, a view is the result set of a stored query on the data, which the database users can query just as they would in a persistent database collection object. This pre-established query command is kept in the database dictionary.
SELECT
statement that feels like a table for users.CREATE VIEW
statement¶CREATE VIEW view_name
AS
SELECT ...;
DROP VIEW IF EXISTS presidents_summary;
CREATE VIEW presidents_summary
AS
SELECT number,
candidate_id,
SUM(votes) AS sum_votes
FROM presidents
GROUP BY number;
SELECT *
FROM presidents_summary;
number | candidate_id | sum_votes |
---|---|---|
1 | 330 | 3690466 |
2 | 331 | 5586019 |
3 | 329 | 4671021 |
DROP
view after CREATE
.DROP VIEW IF EXISTS presidents_summary;
SELECT *
FROM presidents_summary;
Common Table Expressions or CTEs act like temporary views that exist only for the duration of a single SQL statement.
SELECT
statement that feels like a table/view for users.WITH
statement¶WITH cte_name AS (
SELECT ...
)
SELECT ...
FROM cte_name;
WITH presidents_summary_cte AS (
SELECT number,
candidate_id,
SUM(votes) AS sum_votes
FROM presidents
GROUP BY number
)
SELECT *
FROM presidents_summary_cte;
WITH cte_name_1 AS (
SELECT ...
), cte_name_2 AS (
SELECT ...
)
SELECT ...
FROM cte_name_1
SELECT ...
FROM cte_name_2;
WITH presidents_summary_cte AS (
SELECT number,
candidate_id,
SUM(votes) AS sum_votes
FROM presidents
GROUP BY number
), party_legislator_summary_cte AS (
SELECT number,
party_id,
SUM(votes) AS sum_votes
FROM party_legislators
GROUP BY number
)
SELECT *
FROM presidents_summary_cte;
/*
SELECT *
FROM party_legislator_summary_cte;
*/
A relational database is a digital database based on the relational model of data.
Using the relational model, we can build tables that eliminate duplicate data, are easier to maintain, and provide for increased flexibility in writing queries to get just the data we want.
JOIN...ON...
statement¶The query examines both tables and then returns columns from both tables where the values match in the columns specified in the ON
statement.
SELECT left_table.columns,
right_table.columns
FROM left_table
JOIN right_table
ON table_name.join_key = table_name.primary_key
presidents
with candidates
¶SELECT presidents.number,
candidates.name AS candidates_name,
SUM(presidents.votes) AS sum_votes
FROM presidents
JOIN candidates
ON presidents.candidate_id = candidates.id
GROUP BY presidents.number;
number | candidates_name | sum_votes |
---|---|---|
1 | 柯文哲/吳欣盈 | 3690466 |
2 | 賴清德/蕭美琴 | 5586019 |
3 | 侯友宜/趙少康 | 4671021 |
presidents
with candidates
and parties
¶SELECT presidents.number,
candidates.name AS candidates_name,
parties.name AS party_name,
SUM(presidents.votes) AS sum_votes
FROM presidents
JOIN candidates
ON presidents.candidate_id = candidates.id
JOIN parties
ON candidates.party_id = parties.id
GROUP BY presidents.number;
number | candidates_name | party_name | sum_votes |
---|---|---|---|
1 | 柯文哲/吳欣盈 | 台灣民眾黨 | 3690466 |
2 | 賴清德/蕭美琴 | 民主進步黨 | 5586019 |
3 | 侯友宜/趙少康 | 中國國民黨 | 4671021 |
One or more columns whose values uniquely identify each row in a table.
id
.SELECT *
FROM PRAGMA_TABLE_INFO('presidents')
WHERE pk >= 1;
cid | name | type | notnull | dflt_value | pk |
---|---|---|---|---|---|
0 | id | INTEGER | 0 | NULL | 1 |
SELECT *
FROM PRAGMA_TABLE_INFO('candidates')
WHERE pk >= 1;
cid | name | type | notnull | dflt_value | pk |
---|---|---|---|---|---|
0 | id | INTEGER | 0 | NULL | 1 |
SELECT *
FROM PRAGMA_TABLE_INFO('parties')
WHERE pk >= 1;
cid | name | type | notnull | dflt_value | pk |
---|---|---|---|---|---|
0 | id | INTEGER | 0 | NULL | 1 |
_id
. e.g. candidate_id
-> candidates
JOIN
returns rows from the left and the right table where matching values are found¶SELECT presidents.number,
candidates.name AS candidates_name,
SUM(presidents.votes) AS sum_votes
FROM presidents
JOIN candidates
ON presidents.candidate_id = candidates.id
GROUP BY presidents.number;
number | candidates_name | sum_votes |
---|---|---|
1 | 柯文哲/吳欣盈 | 3690466 |
2 | 賴清德/蕭美琴 | 5586019 |
3 | 侯友宜/趙少康 | 4671021 |
JOIN
, the LEFT JOIN
keyword returns all rows from the left table and display blank rows from the other table if no matching values are found¶SELECT left_table.columns,
right_table.columns
FROM left_table
LEFT JOIN right_table
ON table_name.join_key = table_name.primary_key
SELECT candidates.name AS candidates_name,
SUM(presidents.votes) AS sum_votes
FROM candidates
LEFT JOIN presidents
ON presidents.candidate_id = candidates.id
GROUP BY candidates.name
ORDER BY sum_votes DESC
LIMIT 5;
candidates_name | sum_votes |
---|---|
賴清德/蕭美琴 | 5586019 |
侯友宜/趙少康 | 4671021 |
柯文哲/吳欣盈 | 3690466 |
黃秀芳 | NULL |
黃瑞傳 | NULL |
JOIN
types that are not directly supported by SQLite¶RIGHT JOIN
FULL JOIN
RIGHT JOIN
in SQLite by switching left table to right¶SELECT candidates.name AS candidates_name,
SUM(presidents.votes) AS sum_votes
FROM presidents
LEFT JOIN candidates
ON presidents.candidate_id = candidates.id
GROUP BY candidates.name
ORDER BY sum_votes DESC
LIMIT 5;
candidates_name | sum_votes |
---|---|
賴清德/蕭美琴 | 5586019 |
侯友宜/趙少康 | 4671021 |
柯文哲/吳欣盈 | 3690466 |
Source: Pandas User Guide
UNION
keyword¶Source: Pandas User Guide
SELECT '總統副總統得票數' AS election_type,
parties.name AS party_name,
SUM(presidents.votes) AS sum_votes
FROM presidents
JOIN candidates
ON presidents.candidate_id = candidates.id
JOIN parties
ON candidates.party_id = parties.id
GROUP BY parties.name
UNION
SELECT '區域立委得票數' AS election_type,
parties.name AS party_name,
SUM(party_legislators.votes) AS sum_votes
FROM party_legislators
JOIN parties
ON party_legislators.party_id = parties.id
WHERE parties.name IN ('中國國民黨', '台灣民眾黨', '民主進步黨')
GROUP BY parties.name;
election_type | party_name | sum_votes |
---|---|---|
區域立委得票數 | 中國國民黨 | 4764576 |
區域立委得票數 | 台灣民眾黨 | 3040615 |
區域立委得票數 | 民主進步黨 | 4982062 |
總統副總統得票數 | 中國國民黨 | 4671021 |
總統副總統得票數 | 台灣民眾黨 | 3690466 |
總統副總統得票數 | 民主進步黨 | 5586019 |
SQL is about the order of keywords, so follow this convention:
SELECT column_names,
CASE WHEN conditions THEN result
...
ELSE result_else END AS alias
FROM left_table
JOIN | LEFT JOIN right_table
ON table_name.join_key = table_name.primary_key
WHERE conditions
GROUP BY column_names
HAVING aggregated_conditions
UNION ...
ORDER BY column_names
LIMIT m OFFSET m;
taiwan_election_2024.db
database¶A table is a collection of related data held in a table format within a database. It consists of columns and rows.In relational databases, and flat file databases, a table is a set of data elements (values) using a model of vertical columns (identifiable by name) and horizontal rows, the cell being the unit where a row and column intersect. A table has a specified number of columns, but can have any number of rows.
presidents.candidate_id
-> candidates.id
relationship, presidents
is a child table while candidates
being a parent table.candidates.party_id
-> parties.id
relationship, candidates
is a child table while parties
being a parent table.con = sqlite3.connect("normalized_datasets/taiwan_election_2024.db")
cur = con.cursor()
folder_name = "normalized_datasets"
file_names = ["districts.csv", "parties.csv", "election_types.csv", "candidates.csv", "polling_places.csv",
"presidents.csv", "regional_legislators.csv", "aboriginal_legislators.csv", "party_legislators.csv",
"villages.csv"]
for file_name in file_names:
file_path = f"{folder_name}/{file_name}"
df = pd.read_csv(file_path)
table_name = file_name.split(".")[0]
df.to_sql(table_name, con, if_exists="replace", index=False)
districts
¶table_name = "districts"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
county CHAR(3),
town VARCHAR(200),
polling_place INTEGER,
vote_tallied_at CHAR(19),
PRIMARY KEY (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
villages
¶table_name = "villages"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
name VARCHAR(200),
PRIMARY KEY (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
parties
¶table_name = "parties"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
name VARCHAR(100),
PRIMARY KEY (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
election_types
¶table_name = "election_types"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
election_type VARCHAR(10),
PRIMARY KEY (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
candidates
¶table_name = "candidates"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
name VARCHAR(100),
party_id INTEGER,
election_type_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY (party_id) REFERENCES parties (id),
FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
polling_places
¶table_name = "polling_places"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
district_id INTEGER,
polling_place INTEGER,
effective_votes INTEGER,
wasted_votes INTEGER,
issued_votes INTEGER,
remained_votes INTEGER,
election_type_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY (district_id) REFERENCES districts (id),
FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
presidents
¶table_name = "presidents"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
number INTEGER,
district_id INTEGER,
candidate_id INTEGER,
votes INTEGER,
election_type_id INTEGER,
village_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY (district_id) REFERENCES districts (id),
FOREIGN KEY (candidate_id) REFERENCES candidates (id),
FOREIGN KEY (village_id) REFERENCES villages (id),
FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
presidents
¶table_name = "regional_legislators"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
district_id INTEGER,
candidate_id INTEGER,
number INTEGER,
votes INTEGER,
legislator_region VARCHAR(50),
election_type_id INTEGER,
village_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY (district_id) REFERENCES districts (id),
FOREIGN KEY (candidate_id) REFERENCES candidates (id),
FOREIGN KEY (village_id) REFERENCES villages (id),
FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
aboriginal_legislators
¶table_name = "aboriginal_legislators"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
district_id INTEGER,
candidate_id INTEGER,
number INTEGER,
votes INTEGER,
election_type_id INTEGER,
village_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY (district_id) REFERENCES districts (id),
FOREIGN KEY (candidate_id) REFERENCES candidates (id),
FOREIGN KEY (village_id) REFERENCES villages (id),
FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
party_legislators
¶table_name = "party_legislators"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
district_id INTEGER,
party_id INTEGER,
number INTEGER,
votes INTEGER,
election_type_id INTEGER,
village_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY (district_id) REFERENCES districts (id),
FOREIGN KEY (party_id) REFERENCES parties (id),
FOREIGN KEY (village_id) REFERENCES villages (id),
FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
table_names = [fn.split(".")[0] for fn in file_names]
for table_name in table_names:
drop_table_long_str = f"""
PRAGMA foreign_keys=off;
DROP TABLE IF EXISTS {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(drop_table_long_str)
con.commit()
con.close()
An entity–relationship model (or ER model) describes interrelated things of interest in a specific domain of knowledge. A basic ER model is composed of entity types (which classify the things of interest) and specifies relationships that can exist between entities (instances of those entity types).
Source: https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model
taiwan_election_2024.db
¶