Working with Databases - Part I

Biostat 203B

Author

Dr. Hua Zhou @ UCLA

Published

February 13, 2023

1 Data scientist

2 A typical data science project:

2.1 Acknowledgement

Much material in this lecture is adapted from http://www2.stat.duke.edu/~cr173/Sta523_Fa17/sql.html and http://www2.stat.duke.edu/~cr173/Sta523_Fa17/bigish_data.html.

3 Why databases?

3.1 Size of data

  • Small data: those can fit into computer memory.

  • Bigish data: those can fit into disk(s) of a single machine.

  • Big data: those cannot fit into disk(s) of a single machine.

3.2 Computer architecture

Key to high performance is effective use of memory hierarchy. True on all architectures.

3.3 Numbers everyone should know

Operation Time
L1 cache reference 0.5 ns
L2 cache reference 7 ns
Main memory reference 100 ns
Read 1 MB sequentially from memory 250,000 ns
Read 1 MB sequentially from SSD 1,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns

Source: https://gist.github.com/jboner/2841832

3.4 Implications for bigish data

Suppose we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.

If we can store the file in memory:
10 GB × (250 μs/1 MB) = 2.5 seconds

If we have to access the file from SSD (~1GB/sec):
10 GB × (1 ms/1 MB) = 10 seconds

If we have to access the file from disk:
10 GB × (20 ms/1 MB) = 200 seconds

This is just for reading data, if we make any modifications (writing) things are much worse.

3.5 Blocks

Cost: Disk << Memory

Speed: Disk <<< Memory

So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and what if we can’t fit everything into memory?

Create blocks - group rows based on similar attributes and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.

4 Databases

4.1 SQL

Structured Query Language (SQL) is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures.

  • ANSI Standard but with some dialect divergence.

  • SQL functionality maps very closely (but not exactly) with the data manipulation verbs in dplyr.

  • We will see this mapping in more detail in a bit.

4.2 Access databases from R

  • dplyr package supports a variety of databases.

    • Open source databases: SQLite, MySQL, PostgreSQL, BigQuery.
    • Commercial databases: Oracle, Microsoft SQL Server.
    • See link for a complete list.
  • DBI package provides a common interface for connecting to databases.

  • dbplyr package is the backend that translates dplyr verbs to database SQL queries.

  • To install database drivers, follow instructions at https://db.rstudio.com/best-practices/drivers/.

5 A sample session using SQLite

5.1 Create an SQLite database

Create an SQLite database employee.sqlite on disk (or in memory) for learning purpose.

library("DBI")
library("RSQLite")

con = dbConnect(RSQLite::SQLite(), "employee.sqlite")
# con = dbConnect(RSQLite::SQLite(), ":memory:")
str(con)
Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
  ..@ ptr                :<externalptr> 
  ..@ dbname             : chr "/Users/huazhou/Documents/github.com/ucla-biostat-203b/2023winter/slides/12-dbplyr/employee.sqlite"
  ..@ loadable.extensions: logi TRUE
  ..@ flags              : int 70
  ..@ vfs                : chr ""
  ..@ ref                :<environment: 0x7f8e8a0364b8> 
  ..@ bigint             : chr "integer64"
  ..@ extended_types     : logi FALSE

5.2 Add a table into database

First table:

library("tidyverse")

(employees <- tibble(name   = c("Alice", "Bob", "Carol", "Dave", "Eve", "Frank"),
                     email  = c("alice@company.com", "bob@company.com",
                                "carol@company.com", "dave@company.com",
                                "eve@company.com",   "frank@comany.com"),
                     salary = c(52000, 40000, 30000, 33000, 44000, 37000),
                     dept   = c("Accounting", "Accounting", "Sales",
                                "Accounting", "Sales", "Sales")))
# A tibble: 6 × 4
  name  email             salary dept      
  <chr> <chr>              <dbl> <chr>     
1 Alice alice@company.com  52000 Accounting
2 Bob   bob@company.com    40000 Accounting
3 Carol carol@company.com  30000 Sales     
4 Dave  dave@company.com   33000 Accounting
5 Eve   eve@company.com    44000 Sales     
6 Frank frank@comany.com   37000 Sales     

Second table:

(phone <- tibble(name  = c("Bob", "Carol", "Eve", "Frank"),
                 phone = c("919 555-1111", "919 555-2222", "919 555-3333", "919 555-4444")))
# A tibble: 4 × 2
  name  phone       
  <chr> <chr>       
1 Bob   919 555-1111
2 Carol 919 555-2222
3 Eve   919 555-3333
4 Frank 919 555-4444

Write tables to the database:

dbWriteTable(con, "employees", employees, overwrite = TRUE)
dbWriteTable(con, "phone", phone, overwrite = TRUE)
dbListTables(con)
[1] "employees" "phone"    

5.3 Add another table

dbWriteTable(con, "employs", employees)
dbListTables(con)
[1] "employees" "employs"   "phone"    

5.4 Remove a table from database

dbRemoveTable(con, "employs")
dbListTables(con)
[1] "employees" "phone"    

5.5 Querying tables

# select all columns from table employees
res <- dbSendQuery(con, "SELECT * FROM employees")
res
<SQLiteResult>
  SQL  SELECT * FROM employees
  ROWS Fetched: 0 [incomplete]
       Changed: 0
# execute the query
dbFetch(res)
   name             email salary       dept
1 Alice alice@company.com  52000 Accounting
2   Bob   bob@company.com  40000 Accounting
3 Carol carol@company.com  30000      Sales
4  Dave  dave@company.com  33000 Accounting
5   Eve   eve@company.com  44000      Sales
6 Frank  frank@comany.com  37000      Sales
dbClearResult(res)

5.6 Closing the connection

dbDisconnect(con)

6 SQL Queries

Following we demonstrate some common SQL commands, although all task can be achieved by using dplyr as well.

6.1 Connecting

con <- dbConnect(RSQLite::SQLite(), dbname = "employee.sqlite")
dbListTables(con)
[1] "employees" "phone"    
knitr::opts_chunk$set(connection = "con")

Below we demonstrate some common SQL queries.

6.2 SELECT statements

SELECT * FROM employees;
6 records
name email salary dept
Alice alice@company.com 52000 Accounting
Bob bob@company.com 40000 Accounting
Carol carol@company.com 30000 Sales
Dave dave@company.com 33000 Accounting
Eve eve@company.com 44000 Sales
Frank frank@comany.com 37000 Sales
SELECT * FROM phone;
4 records
name phone
Bob 919 555-1111
Carol 919 555-2222
Eve 919 555-3333
Frank 919 555-4444

6.3 Select using SELECT

SELECT name AS first_name, salary FROM employees;
6 records
first_name salary
Alice 52000
Bob 40000
Carol 30000
Dave 33000
Eve 44000
Frank 37000

6.4 Arrange using ORDER BY

SELECT name AS first_name, salary FROM employees ORDER BY salary;
6 records
first_name salary
Carol 30000
Dave 33000
Frank 37000
Bob 40000
Eve 44000
Alice 52000

Descending order:

SELECT name AS first_name, salary FROM employees ORDER BY salary DESC;
6 records
first_name salary
Alice 52000
Eve 44000
Bob 40000
Frank 37000
Dave 33000
Carol 30000

6.5 Filter via WHERE

SELECT * FROM employees WHERE salary < 40000;
3 records
name email salary dept
Carol carol@company.com 30000 Sales
Dave dave@company.com 33000 Accounting
Frank frank@comany.com 37000 Sales

6.6 Group_by via GROUP BY

First record in each group.

SELECT * FROM employees GROUP BY dept;
2 records
name email salary dept
Alice alice@company.com 52000 Accounting
Carol carol@company.com 30000 Sales

6.7 Head via LIMIT

First 3 records:

SELECT * FROM employees LIMIT 3;
3 records
name email salary dept
Alice alice@company.com 52000 Accounting
Bob bob@company.com 40000 Accounting
Carol carol@company.com 30000 Sales

Last 3 records:

SELECT * FROM employees ORDER BY name DESC LIMIT 3;
3 records
name email salary dept
Frank frank@comany.com 37000 Sales
Eve eve@company.com 44000 Sales
Dave dave@company.com 33000 Accounting

6.8 Join two tables (default)

By default SQLite uses a CROSS JOIN (all row combinations) which is not terribly useful. Note only first 10 rows of results are shown.

SELECT * FROM employees JOIN phone;
Displaying records 1 - 10
name email salary dept name phone
Alice alice@company.com 52000 Accounting Bob 919 555-1111
Alice alice@company.com 52000 Accounting Carol 919 555-2222
Alice alice@company.com 52000 Accounting Eve 919 555-3333
Alice alice@company.com 52000 Accounting Frank 919 555-4444
Bob bob@company.com 40000 Accounting Bob 919 555-1111
Bob bob@company.com 40000 Accounting Carol 919 555-2222
Bob bob@company.com 40000 Accounting Eve 919 555-3333
Bob bob@company.com 40000 Accounting Frank 919 555-4444
Carol carol@company.com 30000 Sales Bob 919 555-1111
Carol carol@company.com 30000 Sales Carol 919 555-2222

6.9 Inner join by NATURAL

By default, the common variable name is used as key.

SELECT * FROM employees NATURAL JOIN phone;
4 records
name email salary dept phone
Bob bob@company.com 40000 Accounting 919 555-1111
Carol carol@company.com 30000 Sales 919 555-2222
Eve eve@company.com 44000 Sales 919 555-3333
Frank frank@comany.com 37000 Sales 919 555-4444

6.10 Inner join - explicit

Explicitly specify key.

SELECT * FROM employees JOIN phone ON employees.name = phone.name;
4 records
name email salary dept name phone
Bob bob@company.com 40000 Accounting Bob 919 555-1111
Carol carol@company.com 30000 Sales Carol 919 555-2222
Eve eve@company.com 44000 Sales Eve 919 555-3333
Frank frank@comany.com 37000 Sales Frank 919 555-4444

6.11 Left join - natural

SELECT * FROM employees NATURAL LEFT JOIN phone;
6 records
name email salary dept phone
Alice alice@company.com 52000 Accounting NA
Bob bob@company.com 40000 Accounting 919 555-1111
Carol carol@company.com 30000 Sales 919 555-2222
Dave dave@company.com 33000 Accounting NA
Eve eve@company.com 44000 Sales 919 555-3333
Frank frank@comany.com 37000 Sales 919 555-4444

6.12 Left join - explicit

SELECT * FROM employees LEFT JOIN phone ON employees.name = phone.name;
6 records
name email salary dept name phone
Alice alice@company.com 52000 Accounting NA NA
Bob bob@company.com 40000 Accounting Bob 919 555-1111
Carol carol@company.com 30000 Sales Carol 919 555-2222
Dave dave@company.com 33000 Accounting NA NA
Eve eve@company.com 44000 Sales Eve 919 555-3333
Frank frank@comany.com 37000 Sales Frank 919 555-4444

6.13 Other joins

SQLite does not support directly an OUTER JOIN or a RIGHT JOIN.

6.14 Creating indices

CREATE INDEX index_name ON employees (name);
CREATE INDEX index_name_email ON employees (name, email);
sqlite3 employee.sqlite .indices
index_name        index_name_email

6.15 Close connection

dbDisconnect(con)