Visualization with Modern Data Science¶
Set operations with SQL
Yao-Jen Kuo yaojenkuo@ntu.edu.tw from DATAINPOINT
%LOAD sqlite3 db=data/taiwan_election_2024.db timeout=2 shared_cache=true
View¶
What are views?¶
In a database, a view is the result set of a stored query on the data, which the database users can query just as they would in a persistent database collection object. This pre-established query command is kept in the database dictionary.
Views are like¶
- A persistent sub-query.
- A stored
SELECT
statement that feels like a table to users. - A feature lies between sub-queries and creating a table.
CREATE VIEW
statement¶
CREATE VIEW view_name
AS
SELECT ...;
DROP VIEW IF EXISTS presidents_summary;
CREATE VIEW presidents_summary
AS
SELECT number,
candidate_id,
SUM(votes) AS sum_votes
FROM presidents
GROUP BY number;
Once a view is created, use it like a table¶
SELECT *
FROM presidents_summary;
number | candidate_id | sum_votes |
---|---|---|
1 | 330 | 3690466 |
2 | 331 | 5586019 |
3 | 329 | 4671021 |
View is stored in a persistent format¶
- View is stored persistently in the connected database.
- We have to explicitly
DROP
view afterCREATE
.
DROP VIEW IF EXISTS presidents_summary;
SELECT *
FROM presidents_summary;
Common Table Expression, CTE¶
What is a common table expression¶
Common Table Expressions or CTEs act like temporary views that exist only for the duration of a single SQL statement.
CTEs are like¶
- A non-persistent sub-query.
- A temporary
SELECT
statement that feels like a table/view for users. - A feature lies between sub-queries and creating a view.
WITH
statement¶
WITH cte_name AS (
SELECT ...
)
SELECT ...
FROM cte_name;
WITH presidents_summary_cte AS (
SELECT number,
candidate_id,
SUM(votes) AS sum_votes
FROM presidents
GROUP BY number
)
SELECT *
FROM presidents_summary_cte;
Multiple CTEs¶
WITH cte_name_1 AS (
SELECT ...
), cte_name_2 AS (
SELECT ...
)
SELECT ...
FROM cte_name_1
SELECT ...
FROM cte_name_2;
WITH presidents_summary_cte AS (
SELECT number,
candidate_id,
SUM(votes) AS sum_votes
FROM presidents
GROUP BY number
), party_legislator_summary_cte AS (
SELECT number,
party_id,
SUM(votes) AS sum_votes
FROM party_legislators
GROUP BY number
)
SELECT *
FROM presidents_summary_cte;
/*
SELECT *
FROM party_legislator_summary_cte;
*/
Joining Tables¶
(Recap) What is a relational database¶
A relational database is a digital database based on the relational model of data.
Why relational model¶
Using the relational model, we can build tables that eliminate duplicate data, are easier to maintain, and provide for increased flexibility in writing queries to get just the data we want.
Joining tables using JOIN...ON...
statement¶
The query examines both tables and then returns columns from both tables where the values match in the columns specified in the ON
statement.
SELECT parent_table.columns,
child_table.columns
FROM parent_table
JOIN child_table
ON child_table.join_key = parent_table.primary_key
Joining presidents
with candidates
¶
SELECT presidents.number,
candidates.name AS candidates_name,
SUM(presidents.votes) AS sum_votes
FROM presidents
JOIN candidates
ON presidents.candidate_id = candidates.id
GROUP BY presidents.number;
number | candidates_name | sum_votes |
---|---|---|
1 | 柯文哲/吳欣盈 | 3690466 |
2 | 賴清德/蕭美琴 | 5586019 |
3 | 侯友宜/趙少康 | 4671021 |
Joining presidents
with candidates
and parties
¶
SELECT presidents.number,
candidates.name AS candidates_name,
parties.name AS party_name,
SUM(presidents.votes) AS sum_votes
FROM presidents
JOIN candidates
ON presidents.candidate_id = candidates.id
JOIN parties
ON candidates.party_id = parties.id
GROUP BY presidents.number;
number | candidates_name | party_name | sum_votes |
---|---|---|---|
1 | 柯文哲/吳欣盈 | 台灣民眾黨 | 3690466 |
2 | 賴清德/蕭美琴 | 民主進步黨 | 5586019 |
3 | 侯友宜/趙少康 | 中國國民黨 | 4671021 |
Joining tables with key columns: Primary key¶
One or more columns whose values uniquely identify each row in a table.
- Each column in the key must have a unique value for each row.
- No column in the key can have missing values.
- A primary key constraint is often set on
id
.
Joining tables with join key¶
- One or more columns in a table that match the primary key of another table.
- A join key is not a constraint.
- Two common naming styles for a join key:
- A join key has the same name as the primary key of another table.
- A join key is named as the singular form of another table followed by
_id
. e.g.candidate_id
->candidates
JOIN
returns rows from the left and the right table where matching values are found¶
SELECT presidents.number,
candidates.name AS candidates_name,
SUM(presidents.votes) AS sum_votes
FROM presidents
JOIN candidates
ON presidents.candidate_id = candidates.id
GROUP BY presidents.number;
number | candidates_name | sum_votes |
---|---|---|
1 | 柯文哲/吳欣盈 | 3690466 |
2 | 賴清德/蕭美琴 | 5586019 |
3 | 侯友宜/趙少康 | 4671021 |
In contrast to JOIN
, the LEFT JOIN
keyword returns all rows from the left table and display blank rows from the other table if no matching values are found¶
SELECT left_table.columns,
right_table.columns
FROM left_table
LEFT JOIN right_table
ON left_table.join_key = right_table.primary_key
SELECT candidates.name AS candidates_name,
SUM(presidents.votes) AS sum_votes
FROM candidates
LEFT JOIN presidents
ON presidents.candidate_id = candidates.id
GROUP BY candidates.name
ORDER BY sum_votes DESC
LIMIT 5;
candidates_name | sum_votes |
---|---|
賴清德/蕭美琴 | 5586019 |
侯友宜/趙少康 | 4671021 |
柯文哲/吳欣盈 | 3690466 |
黃秀芳 | NULL |
黃瑞傳 | NULL |
Other JOIN
types are now supported by SQLite¶
RIGHT JOIN
FULL JOIN
SELECT '總統副總統得票數' AS election_type,
parties.name AS party_name,
SUM(presidents.votes) AS sum_votes
FROM presidents
JOIN candidates
ON presidents.candidate_id = candidates.id
JOIN parties
ON candidates.party_id = parties.id
GROUP BY parties.name
UNION
SELECT '區域立委得票數' AS election_type,
parties.name AS party_name,
SUM(party_legislators.votes) AS sum_votes
FROM party_legislators
JOIN parties
ON party_legislators.party_id = parties.id
WHERE parties.name IN ('中國國民黨', '台灣民眾黨', '民主進步黨')
GROUP BY parties.name;
election_type | party_name | sum_votes |
---|---|---|
區域立委得票數 | 中國國民黨 | 4764576 |
區域立委得票數 | 台灣民眾黨 | 3040615 |
區域立委得票數 | 民主進步黨 | 4982062 |
總統副總統得票數 | 中國國民黨 | 4671021 |
總統副總統得票數 | 台灣民眾黨 | 3690466 |
總統副總統得票數 | 民主進步黨 | 5586019 |
Putting what we have so far all together¶
SQL is about the order of keywords, so follow this convention:
SELECT column_names,
CASE WHEN conditions THEN result
...
ELSE result_else END AS alias
FROM left_table
LEFT | RIGHT | FULL | JOIN right_table
ON table_name.join_key = table_name.primary_key
WHERE conditions
GROUP BY column_names
HAVING aggregated_conditions
UNION ...
ORDER BY column_names
LIMIT m
OFFSET n;
The taiwan_election_2024.db
database¶
Data source¶
The ETL steps¶
- Extract.
- Transform.
- Load and normalize.
Extract and Transform¶
- Import multiple excel workbooks as data frames.
- Tidy data frames.
- Transform data frames from wide formats to long formats.
Classfiy tables into parents and children¶
- Parents are tables being referenced.
- Children are tables referencing parents.
- e.g. In a
presidents.candidate_id
->candidates.id
relationship,presidents
is a child table whilecandidates
being a parent table. - e.g. In a
candidates.party_id
->parties.id
relationship,candidates
is a child table whileparties
being a parent table.
- e.g. In a
Load¶
con = sqlite3.connect("normalized_datasets/taiwan_election_2024.db")
cur = con.cursor()
folder_name = "normalized_datasets"
file_names = ["districts.csv", "parties.csv", "election_types.csv", "candidates.csv", "polling_places.csv",
"presidents.csv", "regional_legislators.csv", "aboriginal_legislators.csv", "party_legislators.csv",
"villages.csv"]
for file_name in file_names:
file_path = f"{folder_name}/{file_name}"
df = pd.read_csv(file_path)
table_name = file_name.split(".")[0]
df.to_sql(table_name, con, if_exists="replace", index=False)
Load: districts
¶
table_name = "districts"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
county CHAR(3),
town VARCHAR(200),
polling_place INTEGER,
vote_tallied_at CHAR(19),
PRIMARY KEY (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
Load: villages
¶
table_name = "villages"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
name VARCHAR(200),
PRIMARY KEY (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
Load: parties
¶
table_name = "parties"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
name VARCHAR(100),
PRIMARY KEY (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
Load: election_types
¶
table_name = "election_types"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
election_type VARCHAR(10),
PRIMARY KEY (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
Load: candidates
¶
table_name = "candidates"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
name VARCHAR(100),
party_id INTEGER,
election_type_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY (party_id) REFERENCES parties (id),
FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
Load: polling_places
¶
table_name = "polling_places"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
district_id INTEGER,
polling_place INTEGER,
effective_votes INTEGER,
wasted_votes INTEGER,
issued_votes INTEGER,
remained_votes INTEGER,
election_type_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY (district_id) REFERENCES districts (id),
FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
Load: presidents
¶
table_name = "presidents"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
number INTEGER,
district_id INTEGER,
candidate_id INTEGER,
votes INTEGER,
election_type_id INTEGER,
village_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY (district_id) REFERENCES districts (id),
FOREIGN KEY (candidate_id) REFERENCES candidates (id),
FOREIGN KEY (village_id) REFERENCES villages (id),
FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
Load: presidents
¶
table_name = "regional_legislators"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
district_id INTEGER,
candidate_id INTEGER,
number INTEGER,
votes INTEGER,
legislator_region VARCHAR(50),
election_type_id INTEGER,
village_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY (district_id) REFERENCES districts (id),
FOREIGN KEY (candidate_id) REFERENCES candidates (id),
FOREIGN KEY (village_id) REFERENCES villages (id),
FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
Load: aboriginal_legislators
¶
table_name = "aboriginal_legislators"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
district_id INTEGER,
candidate_id INTEGER,
number INTEGER,
votes INTEGER,
election_type_id INTEGER,
village_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY (district_id) REFERENCES districts (id),
FOREIGN KEY (candidate_id) REFERENCES candidates (id),
FOREIGN KEY (village_id) REFERENCES villages (id),
FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
Load: party_legislators
¶
table_name = "party_legislators"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
id INTEGER,
district_id INTEGER,
party_id INTEGER,
number INTEGER,
votes INTEGER,
election_type_id INTEGER,
village_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY (district_id) REFERENCES districts (id),
FOREIGN KEY (party_id) REFERENCES parties (id),
FOREIGN KEY (village_id) REFERENCES villages (id),
FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()
Load: Drop duplicate tables¶
table_names = [fn.split(".")[0] for fn in file_names]
for table_name in table_names:
drop_table_long_str = f"""
PRAGMA foreign_keys=off;
DROP TABLE IF EXISTS {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(drop_table_long_str)
con.commit()
con.close()
ER diagram¶
An entity–relationship model (or ER model) describes interrelated things of interest in a specific domain of knowledge. A basic ER model is composed of entity types (which classify the things of interest) and specifies relationships that can exist between entities (instances of those entity types).
Source: https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model