Visualization with Modern Data Science¶

Set operations with SQL

Yao-Jen Kuo yaojenkuo@ntu.edu.tw from DATAINPOINT

In [1]:
%LOAD sqlite3 db=data/taiwan_election_2024.db timeout=2 shared_cache=true

View¶

What are views?¶

In a database, a view is the result set of a stored query on the data, which the database users can query just as they would in a persistent database collection object. This pre-established query command is kept in the database dictionary.

Source: https://en.wikipedia.org/wiki/View_(SQL)

Views are like¶

  • A persistent sub-query.
  • A stored SELECT statement that feels like a table to users.
  • A feature lies between sub-queries and creating a table.

CREATE VIEW statement¶

CREATE VIEW view_name
    AS
SELECT ...;
In [2]:
DROP VIEW IF EXISTS presidents_summary;
In [3]:
CREATE VIEW presidents_summary
    AS
SELECT number,
       candidate_id,
       SUM(votes) AS sum_votes
  FROM presidents
 GROUP BY number;

Once a view is created, use it like a table¶

In [4]:
SELECT *
  FROM presidents_summary;
Out[4]:
number candidate_id sum_votes
1 330 3690466
2 331 5586019
3 329 4671021
3 rows in set (0.07 sec)

View is stored in a persistent format¶

  • View is stored persistently in the connected database.
  • We have to explicitly DROP view after CREATE.
DROP VIEW IF EXISTS presidents_summary;

SELECT *
  FROM presidents_summary;

Common Table Expression, CTE¶

What is a common table expression¶

Common Table Expressions or CTEs act like temporary views that exist only for the duration of a single SQL statement.

Source: https://www.sqlite.org/lang_with.html

CTEs are like¶

  • A non-persistent sub-query.
  • A temporary SELECT statement that feels like a table/view for users.
  • A feature lies between sub-queries and creating a view.

WITH statement¶

WITH cte_name AS (
    SELECT ...
)
SELECT ...
  FROM cte_name;
WITH presidents_summary_cte AS (
    SELECT number,
           candidate_id,
           SUM(votes) AS sum_votes
      FROM presidents
     GROUP BY number
)
SELECT *
  FROM presidents_summary_cte;

Multiple CTEs¶

WITH cte_name_1 AS (
    SELECT ...
), cte_name_2 AS (
    SELECT ...
)
SELECT ...
  FROM cte_name_1
SELECT ...
  FROM cte_name_2;
WITH presidents_summary_cte AS (
    SELECT number,
           candidate_id,
           SUM(votes) AS sum_votes
      FROM presidents
     GROUP BY number
), party_legislator_summary_cte AS (
    SELECT number,
           party_id,
           SUM(votes) AS sum_votes
      FROM party_legislators
     GROUP BY number
)
SELECT *
  FROM presidents_summary_cte;
/*
SELECT *
  FROM party_legislator_summary_cte;
*/

Joining Tables¶

(Recap) What is a relational database¶

A relational database is a digital database based on the relational model of data.

Source: https://en.wikipedia.org/wiki/Relational_database

Why relational model¶

Using the relational model, we can build tables that eliminate duplicate data, are easier to maintain, and provide for increased flexibility in writing queries to get just the data we want.

Joining tables using JOIN...ON... statement¶

The query examines both tables and then returns columns from both tables where the values match in the columns specified in the ON statement.

SELECT parent_table.columns,
       child_table.columns
  FROM parent_table 
  JOIN child_table
    ON child_table.join_key = parent_table.primary_key

Joining presidents with candidates¶

In [5]:
SELECT presidents.number,
       candidates.name AS candidates_name,
       SUM(presidents.votes) AS sum_votes
  FROM presidents
  JOIN candidates
    ON presidents.candidate_id = candidates.id
 GROUP BY presidents.number;
Out[5]:
number candidates_name sum_votes
1 柯文哲/吳欣盈 3690466
2 賴清德/蕭美琴 5586019
3 侯友宜/趙少康 4671021
3 rows in set (0.03 sec)

Joining presidents with candidates and parties¶

In [6]:
SELECT presidents.number,
       candidates.name AS candidates_name,
       parties.name AS party_name,
       SUM(presidents.votes) AS sum_votes
  FROM presidents
  JOIN candidates
    ON presidents.candidate_id = candidates.id
  JOIN parties
    ON candidates.party_id = parties.id
 GROUP BY presidents.number;
Out[6]:
number candidates_name party_name sum_votes
1 柯文哲/吳欣盈 台灣民眾黨 3690466
2 賴清德/蕭美琴 民主進步黨 5586019
3 侯友宜/趙少康 中國國民黨 4671021
3 rows in set (0.04 sec)

Joining tables with key columns: Primary key¶

One or more columns whose values uniquely identify each row in a table.

  • Each column in the key must have a unique value for each row.
  • No column in the key can have missing values.
  • A primary key constraint is often set on id.

Joining tables with join key¶

  • One or more columns in a table that match the primary key of another table.
  • A join key is not a constraint.
  • Two common naming styles for a join key:
    • A join key has the same name as the primary key of another table.
    • A join key is named as the singular form of another table followed by _id. e.g. candidate_id -> candidates

JOIN returns rows from the left and the right table where matching values are found¶

In [10]:
SELECT presidents.number,
       candidates.name AS candidates_name,
       SUM(presidents.votes) AS sum_votes
  FROM presidents
  JOIN candidates
    ON presidents.candidate_id = candidates.id
 GROUP BY presidents.number;
Out[10]:
number candidates_name sum_votes
1 柯文哲/吳欣盈 3690466
2 賴清德/蕭美琴 5586019
3 侯友宜/趙少康 4671021
3 rows in set (0.04 sec)

In contrast to JOIN, the LEFT JOIN keyword returns all rows from the left table and display blank rows from the other table if no matching values are found¶

SELECT left_table.columns,
       right_table.columns
  FROM left_table 
  LEFT JOIN right_table
    ON left_table.join_key = right_table.primary_key
In [11]:
SELECT candidates.name AS candidates_name,
       SUM(presidents.votes) AS sum_votes
  FROM candidates
  LEFT JOIN presidents
    ON presidents.candidate_id = candidates.id
 GROUP BY candidates.name
 ORDER BY sum_votes DESC
 LIMIT 5;
Out[11]:
candidates_name sum_votes
賴清德/蕭美琴 5586019
侯友宜/趙少康 4671021
柯文哲/吳欣盈 3690466
黃秀芳 NULL
黃瑞傳 NULL
5 rows in set (0.11 sec)

Other JOIN types are now supported by SQLite¶

  • RIGHT JOIN
  • FULL JOIN

Joining tables is like concatenating tables horizontally¶

Imgur

Source: Pandas User Guide

We can also concatenating tables vertically via UNION keyword¶

Imgur

Source: Pandas User Guide

In [13]:
SELECT '總統副總統得票數' AS election_type,  
       parties.name AS party_name,
       SUM(presidents.votes) AS sum_votes
  FROM presidents
  JOIN candidates
    ON presidents.candidate_id = candidates.id
  JOIN parties
    ON candidates.party_id = parties.id
 GROUP BY parties.name
 UNION
SELECT '區域立委得票數' AS election_type,
       parties.name AS party_name,
       SUM(party_legislators.votes) AS sum_votes
  FROM party_legislators
  JOIN parties
    ON party_legislators.party_id = parties.id
 WHERE parties.name IN ('中國國民黨', '台灣民眾黨', '民主進步黨')
 GROUP BY parties.name;
Out[13]:
election_type party_name sum_votes
區域立委得票數 中國國民黨 4764576
區域立委得票數 台灣民眾黨 3040615
區域立委得票數 民主進步黨 4982062
總統副總統得票數 中國國民黨 4671021
總統副總統得票數 台灣民眾黨 3690466
總統副總統得票數 民主進步黨 5586019
6 rows in set (0.20 sec)

Putting what we have so far all together¶

SQL is about the order of keywords, so follow this convention:

SELECT column_names,
       CASE WHEN conditions THEN result
            ...
            ELSE result_else END AS alias
  FROM left_table
  LEFT | RIGHT | FULL | JOIN right_table
    ON table_name.join_key = table_name.primary_key
 WHERE conditions
 GROUP BY column_names
HAVING aggregated_conditions
 UNION ...
 ORDER BY column_names
 LIMIT m
OFFSET n;

The taiwan_election_2024.db database¶

Data source¶

https://db.cec.gov.tw/ElecTable/Election

The ETL steps¶

  1. Extract.
  2. Transform.
  3. Load and normalize.

Extract and Transform¶

  • Import multiple excel workbooks as data frames.
  • Tidy data frames.
  • Transform data frames from wide formats to long formats.

Classfiy tables into parents and children¶

  • Parents are tables being referenced.
  • Children are tables referencing parents.
    • e.g. In a presidents.candidate_id -> candidates.id relationship, presidents is a child table while candidates being a parent table.
    • e.g. In a candidates.party_id -> parties.id relationship, candidates is a child table while parties being a parent table.

Load¶

con = sqlite3.connect("normalized_datasets/taiwan_election_2024.db")
cur = con.cursor()
folder_name = "normalized_datasets"
file_names = ["districts.csv", "parties.csv", "election_types.csv", "candidates.csv", "polling_places.csv",
              "presidents.csv", "regional_legislators.csv", "aboriginal_legislators.csv", "party_legislators.csv",
              "villages.csv"]
for file_name in file_names:
    file_path = f"{folder_name}/{file_name}"
    df = pd.read_csv(file_path)
    table_name = file_name.split(".")[0]
    df.to_sql(table_name, con, if_exists="replace", index=False)

Load: districts¶

table_name = "districts"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
    id INTEGER,
    county CHAR(3),
    town VARCHAR(200),
    polling_place INTEGER,
    vote_tallied_at CHAR(19),
    PRIMARY KEY (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()

Load: villages¶

table_name = "villages"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
    id INTEGER,
    name VARCHAR(200),
    PRIMARY KEY (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()

Load: parties¶

table_name = "parties"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
    id INTEGER,
    name VARCHAR(100),
    PRIMARY KEY (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()

Load: election_types¶

table_name = "election_types"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
    id INTEGER,
    election_type VARCHAR(10),
    PRIMARY KEY (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()

Load: candidates¶

table_name = "candidates"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
    id INTEGER,
    name VARCHAR(100),
    party_id INTEGER,
    election_type_id INTEGER,
    PRIMARY KEY (id),
    FOREIGN KEY (party_id) REFERENCES parties (id),
    FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()

Load: polling_places¶

table_name = "polling_places"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
    id INTEGER,
    district_id INTEGER,
    polling_place INTEGER,
    effective_votes INTEGER,
    wasted_votes INTEGER,
    issued_votes INTEGER,
    remained_votes INTEGER,
    election_type_id INTEGER,
    PRIMARY KEY (id),
    FOREIGN KEY (district_id) REFERENCES districts (id),
    FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()

Load: presidents¶

table_name = "presidents"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
    id INTEGER,
    number INTEGER,
    district_id INTEGER,
    candidate_id INTEGER,
    votes INTEGER,
    election_type_id INTEGER,
    village_id INTEGER,
    PRIMARY KEY (id),
    FOREIGN KEY (district_id) REFERENCES districts (id),
    FOREIGN KEY (candidate_id) REFERENCES candidates (id),
    FOREIGN KEY (village_id) REFERENCES villages (id),
    FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()

Load: presidents¶

table_name = "regional_legislators"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
    id INTEGER,
    district_id INTEGER,
    candidate_id INTEGER,
    number INTEGER,
    votes INTEGER,
    legislator_region VARCHAR(50),
    election_type_id INTEGER,
    village_id INTEGER,
    PRIMARY KEY (id),
    FOREIGN KEY (district_id) REFERENCES districts (id),
    FOREIGN KEY (candidate_id) REFERENCES candidates (id),
    FOREIGN KEY (village_id) REFERENCES villages (id),
    FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()

Load: aboriginal_legislators¶

table_name = "aboriginal_legislators"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
    id INTEGER,
    district_id INTEGER,
    candidate_id INTEGER,
    number INTEGER,
    votes INTEGER,
    election_type_id INTEGER,
    village_id INTEGER,
    PRIMARY KEY (id),
    FOREIGN KEY (district_id) REFERENCES districts (id),
    FOREIGN KEY (candidate_id) REFERENCES candidates (id),
    FOREIGN KEY (village_id) REFERENCES villages (id),
    FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)

Load: party_legislators¶

table_name = "party_legislators"
create_table_long_str = f"""
PRAGMA foreign_keys=off;
ALTER TABLE {table_name} RENAME TO {table_name}_no_keys;
CREATE TABLE {table_name} (
    id INTEGER,
    district_id INTEGER,
    party_id INTEGER,
    number INTEGER,
    votes INTEGER,
    election_type_id INTEGER,
    village_id INTEGER,
    PRIMARY KEY (id),
    FOREIGN KEY (district_id) REFERENCES districts (id),
    FOREIGN KEY (party_id) REFERENCES parties (id),
    FOREIGN KEY (village_id) REFERENCES villages (id),
    FOREIGN KEY (election_type_id) REFERENCES election_types (id)
);
INSERT INTO {table_name} SELECT * FROM {table_name}_no_keys;
PRAGMA foreign_keys=on;
"""
cur.executescript(create_table_long_str)
con.commit()

Load: Drop duplicate tables¶

table_names = [fn.split(".")[0] for fn in file_names]
for table_name in table_names:
    drop_table_long_str = f"""
    PRAGMA foreign_keys=off;
    DROP TABLE IF EXISTS {table_name}_no_keys;
    PRAGMA foreign_keys=on;
    """
    cur.executescript(drop_table_long_str)
    con.commit()
con.close()

ER diagram¶

An entity–relationship model (or ER model) describes interrelated things of interest in a specific domain of knowledge. A basic ER model is composed of entity types (which classify the things of interest) and specifies relationships that can exist between entities (instances of those entity types).

Source: https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model

ER diagram of taiwan_election_2024.db¶