Suppose we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.
If we can store the file in memory:
10 GB × (250 μs/1 MB) = 2.5 seconds
If we have to access the file from SSD (~1GB/sec):
10 GB × (1 ms/1 MB) = 10 seconds
If we have to access the file from disk:
10 GB × (20 ms/1 MB) = 200 seconds
This is just for reading data, if we make any modifications (writing) things are much worse.
3.5 Blocks
Cost: Disk << Memory
Speed: Disk <<< Memory
So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and what if we can’t fit everything into memory?
Create blocks - group rows based on similar attributes and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.
3.6 Linear vs binary Search
Even with blocks, any kind of subsetting of rows requires a linear search, which requires \(O(N)\) accesses where \(N\) is the number of blocks.
We can do much better if we are careful about how we structure our data, specifically sorting some or all of the columns.
Sorting is expensive, \(O(N \log N)\), but it only needs to be done once.
After sorting, we can use a binary search for any subsetting tasks \(O(\log N)\).
These sorted columns are known as indexes.
Indexes require additional storage, but usually small enough to be kept in memory while blocks stay on disk.
4 Databases
4.1 SQL
Structured Query Language (SQL) is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures.
ANSI Standard but with some dialect divergence.
SQL functionality maps very closely (but not exactly) with the data manipulation verbs in dplyr.
We will see this mapping in more detail in a bit.
4.2 Access databases from R
dplyr package supports a variety of databases.
Open source databases: SQLite, MySQL, PostgreSQL, BigQuery.
Commercial databases: Oracle, Microsoft SQL Server.
name email salary dept
1 Alice alice@company.com 52000 Accounting
2 Bob bob@company.com 40000 Accounting
3 Carol carol@company.com 30000 Sales
4 Dave dave@company.com 33000 Accounting
5 Eve eve@company.com 44000 Sales
6 Frank frank@comany.com 37000 Sales
dbClearResult(res)
5.6 Closing the connection
dbDisconnect(con)
6 SQL Queries
Following we demonstrate some common SQL commands, although all task can be achieved by using dplyr as well.
6.1 Connecting
con <-dbConnect(RSQLite::SQLite(), dbname ="employee.sqlite")dbListTables(con)
[1] "employees" "phone"
knitr::opts_chunk$set(connection ="con")
Below we demonstrate some common SQL queries.
6.2 SELECT statements
SELECT*FROM employees;
6 records
name
email
salary
dept
Alice
alice@company.com
52000
Accounting
Bob
bob@company.com
40000
Accounting
Carol
carol@company.com
30000
Sales
Dave
dave@company.com
33000
Accounting
Eve
eve@company.com
44000
Sales
Frank
frank@comany.com
37000
Sales
SELECT*FROM phone;
4 records
name
phone
Bob
919 555-1111
Carol
919 555-2222
Eve
919 555-3333
Frank
919 555-4444
6.3 Select using SELECT
SELECT name AS first_name, salary FROM employees;
6 records
first_name
salary
Alice
52000
Bob
40000
Carol
30000
Dave
33000
Eve
44000
Frank
37000
6.4 Arrange using ORDER BY
SELECT name AS first_name, salary FROM employees ORDERBY salary;
6 records
first_name
salary
Carol
30000
Dave
33000
Frank
37000
Bob
40000
Eve
44000
Alice
52000
Descending order:
SELECT name AS first_name, salary FROM employees ORDERBY salary DESC;
6 records
first_name
salary
Alice
52000
Eve
44000
Bob
40000
Frank
37000
Dave
33000
Carol
30000
6.5 Filter via WHERE
SELECT*FROM employees WHERE salary <40000;
3 records
name
email
salary
dept
Carol
carol@company.com
30000
Sales
Dave
dave@company.com
33000
Accounting
Frank
frank@comany.com
37000
Sales
6.6 Group_by via GROUP BY
First record in each group.
SELECT*FROM employees GROUPBY dept;
2 records
name
email
salary
dept
Alice
alice@company.com
52000
Accounting
Carol
carol@company.com
30000
Sales
6.7 Head via LIMIT
First 3 records:
SELECT*FROM employees LIMIT3;
3 records
name
email
salary
dept
Alice
alice@company.com
52000
Accounting
Bob
bob@company.com
40000
Accounting
Carol
carol@company.com
30000
Sales
Last 3 records:
SELECT*FROM employees ORDERBY name DESCLIMIT3;
3 records
name
email
salary
dept
Frank
frank@comany.com
37000
Sales
Eve
eve@company.com
44000
Sales
Dave
dave@company.com
33000
Accounting
6.8 Join two tables (default)
By default SQLite uses a CROSS JOIN (all row combinations) which is not terribly useful. Note only first 10 rows of results are shown.
SELECT*FROM employees JOIN phone;
Displaying records 1 - 10
name
email
salary
dept
name
phone
Alice
alice@company.com
52000
Accounting
Bob
919 555-1111
Alice
alice@company.com
52000
Accounting
Carol
919 555-2222
Alice
alice@company.com
52000
Accounting
Eve
919 555-3333
Alice
alice@company.com
52000
Accounting
Frank
919 555-4444
Bob
bob@company.com
40000
Accounting
Bob
919 555-1111
Bob
bob@company.com
40000
Accounting
Carol
919 555-2222
Bob
bob@company.com
40000
Accounting
Eve
919 555-3333
Bob
bob@company.com
40000
Accounting
Frank
919 555-4444
Carol
carol@company.com
30000
Sales
Bob
919 555-1111
Carol
carol@company.com
30000
Sales
Carol
919 555-2222
6.9 Inner join by NATURAL
By default, the common variable name is used as key.
SELECT*FROM employees NATURALJOIN phone;
4 records
name
email
salary
dept
phone
Bob
bob@company.com
40000
Accounting
919 555-1111
Carol
carol@company.com
30000
Sales
919 555-2222
Eve
eve@company.com
44000
Sales
919 555-3333
Frank
frank@comany.com
37000
Sales
919 555-4444
6.10 Inner join - explicit
Explicitly specify key.
SELECT*FROM employees JOIN phone ON employees.name = phone.name;
4 records
name
email
salary
dept
name
phone
Bob
bob@company.com
40000
Accounting
Bob
919 555-1111
Carol
carol@company.com
30000
Sales
Carol
919 555-2222
Eve
eve@company.com
44000
Sales
Eve
919 555-3333
Frank
frank@comany.com
37000
Sales
Frank
919 555-4444
6.11 Left join - natural
SELECT*FROM employees NATURALLEFTJOIN phone;
6 records
name
email
salary
dept
phone
Alice
alice@company.com
52000
Accounting
NA
Bob
bob@company.com
40000
Accounting
919 555-1111
Carol
carol@company.com
30000
Sales
919 555-2222
Dave
dave@company.com
33000
Accounting
NA
Eve
eve@company.com
44000
Sales
919 555-3333
Frank
frank@comany.com
37000
Sales
919 555-4444
6.12 Left join - explicit
SELECT*FROM employees LEFTJOIN phone ON employees.name = phone.name;
6 records
name
email
salary
dept
name
phone
Alice
alice@company.com
52000
Accounting
NA
NA
Bob
bob@company.com
40000
Accounting
Bob
919 555-1111
Carol
carol@company.com
30000
Sales
Carol
919 555-2222
Dave
dave@company.com
33000
Accounting
NA
NA
Eve
eve@company.com
44000
Sales
Eve
919 555-3333
Frank
frank@comany.com
37000
Sales
Frank
919 555-4444
6.13 Other joins
SQLite does not support directly an OUTER JOIN or a RIGHT JOIN.
6.14 Creating indices
CREATEINDEX index_name ON employees (name);
CREATEINDEX index_name_email ON employees (name, email);