Preamble
Display machine information for reproducibility.
R version 4.3.2 (2023-10-31)
Platform: aarch64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/aarch64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/aarch64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.3.2 fastmap_1.1.1 cli_3.6.2
[5] tools_4.3.2 htmltools_0.5.7 rstudioapi_0.15.0 yaml_2.3.8
[9] rmarkdown_2.25 knitr_1.45 jsonlite_1.8.8 xfun_0.41
[13] digest_0.6.33 rlang_1.1.2 evaluate_0.23
import IPython
print (IPython.sys_info())
{'commit_hash': 'dc4369111',
'commit_source': 'installation',
'default_encoding': 'utf-8',
'ipython_path': '/opt/venv/lib/python3.10/site-packages/IPython',
'ipython_version': '8.19.0',
'os_name': 'posix',
'platform': 'Linux-6.6.12-linuxkit-aarch64-with-glibc2.35',
'sys_executable': '/opt/venv/bin/python',
'sys_platform': 'linux',
'sys_version': '3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]'}
using InteractiveUtils
versioninfo ()
Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (aarch64-linux-gnu)
CPU: 12 × unknown
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, generic)
Threads: 1 on 12 virtual cores
Environment:
JULIA_PKGDIR = /opt/julia
JULIA_DEPOT_PATH = /opt/julia
LD_LIBRARY_PATH = /usr/local/lib/R/lib:/usr/local/lib:/usr/lib/aarch64-linux-gnu:/usr/lib/jvm/java-11-openjdk-arm64/lib/server:/usr/local/lib/R/lib:/lib:/usr/local/lib:/usr/lib/aarch64-linux-gnu:/usr/lib/jvm/java-11-openjdk-arm64/lib/server
Load tidyverse (R), Pandas (Python), and DataFrames.jl (Julia).
# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
using DataFrames , Pipe , StatsBase
A typical data science project:
nycflights13 data
library ("nycflights13" )
flights
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
The nycflights13 data is available from the nycflights13 package in Python.
from nycflights13 import flights
flights
year month day ... hour minute time_hour
0 2013 1 1 ... 5 15 2013-01-01T10:00:00Z
1 2013 1 1 ... 5 29 2013-01-01T10:00:00Z
2 2013 1 1 ... 5 40 2013-01-01T10:00:00Z
3 2013 1 1 ... 5 45 2013-01-01T10:00:00Z
4 2013 1 1 ... 6 0 2013-01-01T11:00:00Z
... ... ... ... ... ... ... ...
336771 2013 9 30 ... 14 55 2013-09-30T18:00:00Z
336772 2013 9 30 ... 22 0 2013-10-01T02:00:00Z
336773 2013 9 30 ... 12 10 2013-09-30T16:00:00Z
336774 2013 9 30 ... 11 59 2013-09-30T15:00:00Z
336775 2013 9 30 ... 8 40 2013-09-30T12:00:00Z
[336776 rows x 19 columns]
Note there are some differences of this flights
data from that in tidyverse. The data types for some variables are different. There are no natural ways in Pandas to hold integer column with missing values; so dep_time
, arr_time
are float64
instead of int64
.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 336776 non-null int64
1 month 336776 non-null int64
2 day 336776 non-null int64
3 dep_time 328521 non-null float64
4 sched_dep_time 336776 non-null int64
5 dep_delay 328521 non-null float64
6 arr_time 328063 non-null float64
7 sched_arr_time 336776 non-null int64
8 arr_delay 327346 non-null float64
9 carrier 336776 non-null object
10 flight 336776 non-null int64
11 tailnum 334264 non-null object
12 origin 336776 non-null object
13 dest 336776 non-null object
14 air_time 327346 non-null float64
15 distance 336776 non-null int64
16 hour 336776 non-null int64
17 minute 336776 non-null int64
18 time_hour 336776 non-null object
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB
To be more consistent with nycflights13
in tidyverse, we cast time_hour
to datetime
type.
flights['time_hour' ] = pd.to_datetime(flights['time_hour' ])
Let’s use RCall.jl to retrieve the nycflights13 data from R.
using RCall
R"""
library(nycflights13)
"""
flights = rcopy (R"flights" )
336776×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 517 515 2.0 830 ⋯
2 │ 2013 1 1 533 529 4.0 850
3 │ 2013 1 1 542 540 2.0 923
4 │ 2013 1 1 544 545 -1.0 1004
5 │ 2013 1 1 554 600 -6.0 812 ⋯
6 │ 2013 1 1 554 558 -4.0 740
7 │ 2013 1 1 555 600 -5.0 913
8 │ 2013 1 1 557 600 -3.0 709
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
336770 │ 2013 9 30 2349 2359 -10.0 325 ⋯
336771 │ 2013 9 30 missing 1842 missing missing
336772 │ 2013 9 30 missing 1455 missing missing
336773 │ 2013 9 30 missing 2200 missing missing
336774 │ 2013 9 30 missing 1210 missing missing ⋯
336775 │ 2013 9 30 missing 1159 missing missing
336776 │ 2013 9 30 missing 840 missing missing
12 columns and 336761 rows omitted
To display more rows or columns:
By default, tibble prints the first 10 rows and all columns that fit on screen .
To change number of rows and columns to display:
nycflights13:: flights |>
print (n = 10 , width = Inf )
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
arr_delay carrier flight tailnum origin dest air_time distance hour minute
<dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
5 -25 DL 461 N668DN LGA ATL 116 762 6 0
6 12 UA 1696 N39463 EWR ORD 150 719 5 58
7 19 B6 507 N516JB EWR FLL 158 1065 6 0
8 -14 EV 5708 N829AS LGA IAD 53 229 6 0
9 -8 B6 79 N593JB JFK MCO 140 944 6 0
10 8 AA 301 N3ALAA LGA ORD 138 733 6 0
time_hour
<dttm>
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
3 2013-01-01 05:00:00
4 2013-01-01 05:00:00
5 2013-01-01 06:00:00
6 2013-01-01 05:00:00
7 2013-01-01 06:00:00
8 2013-01-01 06:00:00
9 2013-01-01 06:00:00
10 2013-01-01 06:00:00
# ℹ 336,766 more rows
Here we see the pipe operator |>
pipes the output from previous command to the (first) argument of the next command.
To change the default print setting globally:
options(tibble.print_max = n, tibble.print_min = m)
: if more than m
rows, print only n
rows.
options(dplyr.print_min = Inf)
: print all row.
options(tibble.width = Inf)
: print all columns.
pd.set_option("display.max_rows" , 500 )
pd.set_option("display.max_columns" , 20 )
By default DataFrames.jl limits the number of rows and columns when displaying a data frame in a Jupyter Notebook to 25 and 100, respectively. You can override this behavior by changing the values of the ENV["DATAFRAMES_COLUMNS"]
and ENV["DATAFRAMES_ROWS"]
variables to hold the maximum number of columns and rows of the output. All columns or rows will be printed if those numbers are equal or lower than 0.
dplyr basics
Pick observations (rows) by their values: filter()
.
Reorder the rows: arrange()
.
Pick variables (columns) by their names: select()
.
Create new variables with functions of existing variables: mutate()
.
Collapse many values down to a single summary: summarise()
.
verb meaning
--------------------------------------------
filter() subset observations (or rows)
arrange() re-order the observations
distinct() remove duplicate entries
slice_*() select rows by position
sample_*() sample rows
--------------------------------------------
select() select variables (or columns)
mutate() add new variables (or columns)
relocate() move variables (or columns) to new positions
rename() rename variables (or columns)
--------------------------------------------
group_by() aggregate
summarise() reduce to a single row
--------------------------------------------
left_join() merge two data objects
collect() force computation and bring data back into R
Manipulate rows (cases)
Filter rows with filter()
# same as filter(flights, month == 1 & day == 1)
filter (flights, month == 1 , day == 1 )
# A tibble: 842 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 832 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights[(flights['month' ] == 1 ) & (flights['day' ] == 1 )]
year month day dep_time sched_dep_time dep_delay arr_time \
0 2013 1 1 517.0 515 2.0 830.0
1 2013 1 1 533.0 529 4.0 850.0
2 2013 1 1 542.0 540 2.0 923.0
3 2013 1 1 544.0 545 -1.0 1004.0
4 2013 1 1 554.0 600 -6.0 812.0
.. ... ... ... ... ... ... ...
837 2013 1 1 2356.0 2359 -3.0 425.0
838 2013 1 1 NaN 1630 NaN NaN
839 2013 1 1 NaN 1935 NaN NaN
840 2013 1 1 NaN 1500 NaN NaN
841 2013 1 1 NaN 600 NaN NaN
sched_arr_time arr_delay carrier flight tailnum origin dest air_time \
0 819 11.0 UA 1545 N14228 EWR IAH 227.0
1 830 20.0 UA 1714 N24211 LGA IAH 227.0
2 850 33.0 AA 1141 N619AA JFK MIA 160.0
3 1022 -18.0 B6 725 N804JB JFK BQN 183.0
4 837 -25.0 DL 461 N668DN LGA ATL 116.0
.. ... ... ... ... ... ... ... ...
837 437 -12.0 B6 727 N588JB JFK BQN 186.0
838 1815 NaN EV 4308 N18120 EWR RDU NaN
839 2240 NaN AA 791 N3EHAA LGA DFW NaN
840 1825 NaN AA 1925 N3EVAA LGA MIA NaN
841 901 NaN B6 125 N618JB JFK FLL NaN
distance hour minute time_hour
0 1400 5 15 2013-01-01 10:00:00+00:00
1 1416 5 29 2013-01-01 10:00:00+00:00
2 1089 5 40 2013-01-01 10:00:00+00:00
3 1576 5 45 2013-01-01 10:00:00+00:00
4 762 6 0 2013-01-01 11:00:00+00:00
.. ... ... ... ...
837 1576 23 59 2013-01-02 04:00:00+00:00
838 416 16 30 2013-01-01 21:00:00+00:00
839 1389 19 35 2013-01-02 00:00:00+00:00
840 1096 15 0 2013-01-01 20:00:00+00:00
841 1069 6 0 2013-01-01 11:00:00+00:00
[842 rows x 19 columns]
filter (row -> (row.month == 1 ) & (row.day == 1 ), flights)
842×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time sch ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? Int ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 517 515 2.0 830 ⋯
2 │ 2013 1 1 533 529 4.0 850
3 │ 2013 1 1 542 540 2.0 923
4 │ 2013 1 1 544 545 -1.0 1004
5 │ 2013 1 1 554 600 -6.0 812 ⋯
6 │ 2013 1 1 554 558 -4.0 740
7 │ 2013 1 1 555 600 -5.0 913
8 │ 2013 1 1 557 600 -3.0 709
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
836 │ 2013 1 1 2353 2359 -6.0 425 ⋯
837 │ 2013 1 1 2353 2359 -6.0 418
838 │ 2013 1 1 2356 2359 -3.0 425
839 │ 2013 1 1 missing 1630 missing missing
840 │ 2013 1 1 missing 1935 missing missing ⋯
841 │ 2013 1 1 missing 1500 missing missing
842 │ 2013 1 1 missing 600 missing missing
12 columns and 827 rows omitted
filter (flights, month == 11 | month == 12 )
# A tibble: 55,403 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 11 1 5 2359 6 352 345
2 2013 11 1 35 2250 105 123 2356
3 2013 11 1 455 500 -5 641 651
4 2013 11 1 539 545 -6 856 827
5 2013 11 1 542 545 -3 831 855
6 2013 11 1 549 600 -11 912 923
7 2013 11 1 550 600 -10 705 659
8 2013 11 1 554 600 -6 659 701
9 2013 11 1 554 600 -6 826 827
10 2013 11 1 554 600 -6 749 751
# ℹ 55,393 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights[(flights['month' ] == 11 ) | (flights['month' ] == 12 )]
year month day dep_time sched_dep_time dep_delay arr_time \
55893 2013 11 1 5.0 2359 6.0 352.0
55894 2013 11 1 35.0 2250 105.0 123.0
55895 2013 11 1 455.0 500 -5.0 641.0
55896 2013 11 1 539.0 545 -6.0 856.0
55897 2013 11 1 542.0 545 -3.0 831.0
... ... ... ... ... ... ... ...
111291 2013 12 31 NaN 705 NaN NaN
111292 2013 12 31 NaN 825 NaN NaN
111293 2013 12 31 NaN 1615 NaN NaN
111294 2013 12 31 NaN 600 NaN NaN
111295 2013 12 31 NaN 830 NaN NaN
sched_arr_time arr_delay carrier flight tailnum origin dest \
55893 345 7.0 B6 745 N568JB JFK PSE
55894 2356 87.0 B6 1816 N353JB JFK SYR
55895 651 -10.0 US 1895 N192UW EWR CLT
55896 827 29.0 UA 1714 N38727 LGA IAH
55897 855 -24.0 AA 2243 N5CLAA JFK MIA
... ... ... ... ... ... ... ...
111291 931 NaN UA 1729 NaN EWR DEN
111292 1029 NaN US 1831 NaN JFK CLT
111293 1800 NaN MQ 3301 N844MQ LGA RDU
111294 735 NaN UA 219 NaN EWR ORD
111295 1154 NaN UA 443 NaN JFK LAX
air_time distance hour minute time_hour
55893 205.0 1617 23 59 2013-11-02 03:00:00+00:00
55894 36.0 209 22 50 2013-11-02 02:00:00+00:00
55895 88.0 529 5 0 2013-11-01 09:00:00+00:00
55896 229.0 1416 5 45 2013-11-01 09:00:00+00:00
55897 147.0 1089 5 45 2013-11-01 09:00:00+00:00
... ... ... ... ... ...
111291 NaN 1605 7 5 2013-12-31 12:00:00+00:00
111292 NaN 541 8 25 2013-12-31 13:00:00+00:00
111293 NaN 431 16 15 2013-12-31 21:00:00+00:00
111294 NaN 719 6 0 2013-12-31 11:00:00+00:00
111295 NaN 2475 8 30 2013-12-31 13:00:00+00:00
[55403 rows x 19 columns]
filter (row -> (row.month == 11 ) | (row.month == 12 ), flights)
55403×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time s ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? I ⋯
───────┼────────────────────────────────────────────────────────────────────────
1 │ 2013 11 1 5 2359 6.0 352 ⋯
2 │ 2013 11 1 35 2250 105.0 123
3 │ 2013 11 1 455 500 -5.0 641
4 │ 2013 11 1 539 545 -6.0 856
5 │ 2013 11 1 542 545 -3.0 831 ⋯
6 │ 2013 11 1 549 600 -11.0 912
7 │ 2013 11 1 550 600 -10.0 705
8 │ 2013 11 1 554 600 -6.0 659
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
55397 │ 2013 12 31 missing 1430 missing missing ⋯
55398 │ 2013 12 31 missing 855 missing missing
55399 │ 2013 12 31 missing 705 missing missing
55400 │ 2013 12 31 missing 825 missing missing
55401 │ 2013 12 31 missing 1615 missing missing ⋯
55402 │ 2013 12 31 missing 600 missing missing
55403 │ 2013 12 31 missing 830 missing missing
12 columns and 55388 rows omitted
Remove rows with duplicate values
distinct (flights, month, .keep_all = TRUE )
# A tibble: 12 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 10 1 447 500 -13 614 648
3 2013 11 1 5 2359 6 352 345
4 2013 12 1 13 2359 14 446 445
5 2013 2 1 456 500 -4 652 648
6 2013 3 1 4 2159 125 318 56
7 2013 4 1 454 500 -6 636 640
8 2013 5 1 9 1655 434 308 2020
9 2013 6 1 2 2359 3 341 350
10 2013 7 1 1 2029 212 236 2359
11 2013 8 1 12 2130 162 257 14
12 2013 9 1 9 2359 10 343 340
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
With .keep_all = FALSE
, all variables/columns are kept:
# A tibble: 12 × 1
month
<int>
1 1
2 10
3 11
4 12
5 2
6 3
7 4
8 5
9 6
10 7
11 8
12 9
flights.drop_duplicates(subset = ['month' ])
year month day dep_time sched_dep_time dep_delay arr_time \
0 2013 1 1 517.0 515 2.0 830.0
27004 2013 10 1 447.0 500 -13.0 614.0
55893 2013 11 1 5.0 2359 6.0 352.0
83161 2013 12 1 13.0 2359 14.0 446.0
111296 2013 2 1 456.0 500 -4.0 652.0
136247 2013 3 1 4.0 2159 125.0 318.0
165081 2013 4 1 454.0 500 -6.0 636.0
193411 2013 5 1 9.0 1655 434.0 308.0
222207 2013 6 1 2.0 2359 3.0 341.0
250450 2013 7 1 1.0 2029 212.0 236.0
279875 2013 8 1 12.0 2130 162.0 257.0
309202 2013 9 1 9.0 2359 10.0 343.0
sched_arr_time arr_delay carrier flight tailnum origin dest \
0 819 11.0 UA 1545 N14228 EWR IAH
27004 648 -34.0 US 1877 N538UW EWR CLT
55893 345 7.0 B6 745 N568JB JFK PSE
83161 445 1.0 B6 745 N715JB JFK PSE
111296 648 4.0 US 1117 N197UW EWR CLT
136247 56 142.0 B6 11 N706JB JFK FLL
165081 640 -4.0 US 1843 N566UW EWR CLT
193411 2020 408.0 VX 413 N628VA JFK LAX
222207 350 -9.0 B6 739 N618JB JFK PSE
250450 2359 157.0 B6 915 N653JB JFK SFO
279875 14 163.0 B6 1371 N618JB LGA FLL
309202 340 3.0 B6 839 N663JB JFK BQN
air_time distance hour minute time_hour
0 227.0 1400 5 15 2013-01-01 10:00:00+00:00
27004 69.0 529 5 0 2013-10-01 09:00:00+00:00
55893 205.0 1617 23 59 2013-11-02 03:00:00+00:00
83161 195.0 1617 23 59 2013-12-02 04:00:00+00:00
111296 98.0 529 5 0 2013-02-01 10:00:00+00:00
136247 166.0 1069 21 59 2013-03-02 02:00:00+00:00
165081 84.0 529 5 0 2013-04-01 09:00:00+00:00
193411 341.0 2475 16 55 2013-05-01 20:00:00+00:00
222207 200.0 1617 23 59 2013-06-02 03:00:00+00:00
250450 315.0 2586 20 29 2013-07-02 00:00:00+00:00
279875 149.0 1076 21 30 2013-08-02 01:00:00+00:00
309202 196.0 1576 23 59 2013-09-02 03:00:00+00:00
12×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time sch ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? Int ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 517 515 2.0 830 ⋯
2 │ 2013 10 1 447 500 -13.0 614
3 │ 2013 11 1 5 2359 6.0 352
4 │ 2013 12 1 13 2359 14.0 446
5 │ 2013 2 1 456 500 -4.0 652 ⋯
6 │ 2013 3 1 4 2159 125.0 318
7 │ 2013 4 1 454 500 -6.0 636
8 │ 2013 5 1 9 1655 434.0 308
9 │ 2013 6 1 2 2359 3.0 341 ⋯
10 │ 2013 7 1 1 2029 212.0 236
11 │ 2013 8 1 12 2130 162.0 257
12 │ 2013 9 1 9 2359 10.0 343
12 columns omitted
Sample rows
sample_n (flights, 10 , replace = TRUE )
# A tibble: 10 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 2 1 1812 1815 -3 2016 2032
2 2013 6 7 2336 2245 51 120 2357
3 2013 11 4 2027 2025 2 2240 2242
4 2013 12 21 1220 1207 13 1356 1348
5 2013 3 13 1103 1101 2 1408 1433
6 2013 10 8 1049 1059 -10 1221 1242
7 2013 11 23 1657 1700 -3 2009 2011
8 2013 7 21 924 829 55 1133 1004
9 2013 9 9 1208 1205 3 1357 1418
10 2013 12 24 2130 2130 0 2353 2359
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Randomly select fraction of rows:
sample_frac (flights, 0.1 , replace = TRUE )
# A tibble: 33,678 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 8 28 1935 1935 0 2222 2237
2 2013 12 20 1443 1445 -2 1755 1813
3 2013 6 1 1614 1540 34 1800 1733
4 2013 9 10 1704 1700 4 1835 1817
5 2013 3 21 1421 1422 -1 1703 1715
6 2013 10 27 1055 1030 25 1335 1324
7 2013 8 30 1654 1700 -6 1849 1923
8 2013 1 22 955 1000 -5 1256 1254
9 2013 11 15 1825 1830 -5 2131 2138
10 2013 3 20 1643 1646 -3 1735 1751
# ℹ 33,668 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Sample n=10
rows.
flights.sample(n = 10 , axis = 0 , replace = True )
year month day dep_time sched_dep_time dep_delay arr_time \
19713 2013 1 23 1631.0 1635 -4.0 1853.0
37769 2013 10 12 1228.0 1235 -7.0 1414.0
111938 2013 2 1 1701.0 1708 -7.0 2009.0
56981 2013 11 2 755.0 800 -5.0 1057.0
73845 2013 11 20 1212.0 1220 -8.0 1356.0
21628 2013 1 25 1801.0 1810 -9.0 2152.0
15695 2013 1 18 1855.0 1859 -4.0 2157.0
111914 2013 2 1 1645.0 1645 0.0 1847.0
232371 2013 6 11 2358.0 2359 -1.0 345.0
50745 2013 10 26 1152.0 1200 -8.0 1439.0
sched_arr_time arr_delay carrier flight tailnum origin dest \
19713 1907 -14.0 EV 4705 N13989 EWR ATL
37769 1426 -12.0 EV 4649 N16954 EWR CLT
111938 2024 -15.0 UA 250 N486UA EWR LAX
56981 1111 -14.0 B6 1511 N760JB EWR RSW
73845 1355 1.0 MQ 3361 N816MQ JFK RDU
21628 2142 10.0 B6 173 N565JB JFK SJC
15695 2225 -28.0 B6 171 N621JB JFK SMF
111914 1900 -13.0 DL 1473 N338NB LGA MEM
232371 341 4.0 B6 727 N562JB JFK BQN
50745 1455 -16.0 AA 3 N328AA JFK LAX
air_time distance hour minute time_hour
19713 124.0 746 16 35 2013-01-23 21:00:00+00:00
37769 77.0 529 12 35 2013-10-12 16:00:00+00:00
111938 339.0 2454 17 8 2013-02-01 22:00:00+00:00
56981 159.0 1068 8 0 2013-11-02 12:00:00+00:00
73845 73.0 427 12 20 2013-11-20 17:00:00+00:00
21628 363.0 2569 18 10 2013-01-25 23:00:00+00:00
15695 342.0 2521 18 59 2013-01-18 23:00:00+00:00
111914 145.0 963 16 45 2013-02-01 21:00:00+00:00
232371 200.0 1576 23 59 2013-06-12 03:00:00+00:00
50745 323.0 2475 12 0 2013-10-26 16:00:00+00:00
Sample 10% rows:
flights.sample(frac = 0.1 , replace = True )
year month day dep_time sched_dep_time dep_delay arr_time \
147407 2013 3 13 608.0 610 -2.0 921.0
241844 2013 6 21 1911.0 1900 11.0 2202.0
55796 2013 10 31 1943.0 1935 8.0 2210.0
278900 2013 7 31 559.0 600 -1.0 735.0
303174 2013 8 25 1408.0 1411 -3.0 1525.0
... ... ... ... ... ... ... ...
278008 2013 7 30 730.0 730 0.0 1027.0
303850 2013 8 26 810.0 815 -5.0 1012.0
274960 2013 7 26 1911.0 1915 -4.0 2227.0
34148 2013 10 8 1552.0 1555 -3.0 1751.0
149781 2013 3 15 1350.0 1329 21.0 1535.0
sched_arr_time arr_delay carrier flight tailnum origin dest \
147407 915 6.0 AA 1837 N3HBAA LGA MIA
241844 2210 -8.0 DL 2159 N393DA JFK MCO
55796 2147 23.0 9E 3470 N926XJ JFK CVG
278900 752 -17.0 DL 731 N334NB LGA DTW
303174 1535 -10.0 EV 4171 N19554 EWR MSN
... ... ... ... ... ... ... ...
278008 1025 2.0 AA 33 N328AA JFK LAX
303850 1030 -18.0 9E 3538 N909XJ JFK MSP
274960 2205 22.0 AA 791 N3GWAA LGA DFW
34148 1805 -14.0 MQ 3075 N672MQ JFK CVG
149781 1526 9.0 EV 4100 N11187 EWR MEM
air_time distance hour minute time_hour
147407 167.0 1096 6 10 2013-03-13 10:00:00+00:00
241844 130.0 944 19 0 2013-06-21 23:00:00+00:00
55796 103.0 589 19 35 2013-10-31 23:00:00+00:00
278900 78.0 502 6 0 2013-07-31 10:00:00+00:00
303174 112.0 799 14 11 2013-08-25 18:00:00+00:00
... ... ... ... ... ...
278008 327.0 2475 7 30 2013-07-30 11:00:00+00:00
303850 146.0 1029 8 15 2013-08-26 12:00:00+00:00
274960 195.0 1389 19 15 2013-07-26 23:00:00+00:00
34148 93.0 589 15 55 2013-10-08 19:00:00+00:00
149781 144.0 946 13 29 2013-03-15 17:00:00+00:00
[33678 rows x 19 columns]
I’m not sure whether there’s a native function in DataFrames.jl for sampling.
Sample 10 rows:
rowidx = StatsBase.sample (1 : nrow (flights), 10 , replace = true );
flights[rowidx, : ]
10×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time sch ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? Int ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2013 9 4 2030 2034 -4.0 2134 ⋯
2 │ 2013 5 6 849 900 -11.0 1021
3 │ 2013 9 22 1201 1206 -5.0 1444
4 │ 2013 7 31 1135 829 186.0 1300
5 │ 2013 7 10 1204 1209 -5.0 1401 ⋯
6 │ 2013 9 18 857 900 -3.0 1210
7 │ 2013 7 7 1827 1830 -3.0 2136
8 │ 2013 10 4 1642 1645 -3.0 1756
9 │ 2013 6 19 1522 1527 -5.0 1646 ⋯
10 │ 2013 3 21 1310 1200 70.0 1427
12 columns omitted
Sample 10% rows:
rowidx = StatsBase.sample (
1 : nrow (flights),
round (Int , nrow (flights) * 0.1 ),
replace = true );
flights[rowidx, : ]
33678×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time s ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? I ⋯
───────┼────────────────────────────────────────────────────────────────────────
1 │ 2013 11 26 858 900 -2.0 1128 ⋯
2 │ 2013 1 11 1344 1350 -6.0 1558
3 │ 2013 2 23 1127 1059 28.0 1418
4 │ 2013 12 5 553 600 -7.0 919
5 │ 2013 5 25 1339 1340 -1.0 1600 ⋯
6 │ 2013 9 15 1324 1330 -6.0 1545
7 │ 2013 3 24 1911 1810 61.0 2250
8 │ 2013 4 6 713 715 -2.0 902
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
33672 │ 2013 7 12 2038 1935 63.0 2357 ⋯
33673 │ 2013 8 26 2117 2100 17.0 2254
33674 │ 2013 5 13 731 735 -4.0 1027
33675 │ 2013 7 11 946 825 81.0 1149
33676 │ 2013 7 5 650 655 -5.0 917 ⋯
33677 │ 2013 8 21 1214 1219 -5.0 1405
33678 │ 2013 1 6 954 955 -1.0 1233
12 columns and 33663 rows omitted
Select rows by position
# A tibble: 5 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
slice_head (flights, n = 5 )
# A tibble: 5 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
slice_tail (flights, n = 5 )
# A tibble: 5 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 9 30 NA 1455 NA NA 1634
2 2013 9 30 NA 2200 NA NA 2312
3 2013 9 30 NA 1210 NA NA 1330
4 2013 9 30 NA 1159 NA NA 1344
5 2013 9 30 NA 840 NA NA 1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Top n
rows with the highest values:
# deprecated: top_n(flights, 5, wt = time_hour)
# This function is quick
slice_max (flights, n = 5 , order_by = time_hour) |>
print (width = Inf )
# A tibble: 5 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 12 31 13 2359 14 439 437
2 2013 12 31 18 2359 19 449 444
3 2013 12 31 2328 2330 -2 412 409
4 2013 12 31 2355 2359 -4 430 440
5 2013 12 31 2356 2359 -3 436 445
arr_delay carrier flight tailnum origin dest air_time distance hour minute
<dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 B6 839 N566JB JFK BQN 189 1576 23 59
2 5 DL 412 N713TW JFK SJU 192 1598 23 59
3 3 B6 1389 N651JB EWR SJU 198 1608 23 30
4 -10 B6 1503 N509JB JFK SJU 195 1598 23 59
5 -9 B6 745 N665JB JFK PSE 200 1617 23 59
time_hour
<dttm>
1 2013-12-31 23:00:00
2 2013-12-31 23:00:00
3 2013-12-31 23:00:00
4 2013-12-31 23:00:00
5 2013-12-31 23:00:00
Bottom n
rows with lowest values:
# same as slice_max(flights, n = 5, order_by = desc(time_hour))
slice_min (flights, n = 5 , order_by = time_hour) |>
print (width = Inf )
# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 558 -4 740 728
6 2013 1 1 559 559 0 702 706
arr_delay carrier flight tailnum origin dest air_time distance hour minute
<dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
5 12 UA 1696 N39463 EWR ORD 150 719 5 58
6 -4 B6 1806 N708JB JFK BOS 44 187 5 59
time_hour
<dttm>
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
3 2013-01-01 05:00:00
4 2013-01-01 05:00:00
5 2013-01-01 05:00:00
6 2013-01-01 05:00:00
slice_*
verbs apply to groups for grouped tibbles.
flights.iloc[range (0 , 5 )]
year month day dep_time sched_dep_time dep_delay arr_time \
0 2013 1 1 517.0 515 2.0 830.0
1 2013 1 1 533.0 529 4.0 850.0
2 2013 1 1 542.0 540 2.0 923.0
3 2013 1 1 544.0 545 -1.0 1004.0
4 2013 1 1 554.0 600 -6.0 812.0
sched_arr_time arr_delay carrier flight tailnum origin dest air_time \
0 819 11.0 UA 1545 N14228 EWR IAH 227.0
1 830 20.0 UA 1714 N24211 LGA IAH 227.0
2 850 33.0 AA 1141 N619AA JFK MIA 160.0
3 1022 -18.0 B6 725 N804JB JFK BQN 183.0
4 837 -25.0 DL 461 N668DN LGA ATL 116.0
distance hour minute time_hour
0 1400 5 15 2013-01-01 10:00:00+00:00
1 1416 5 29 2013-01-01 10:00:00+00:00
2 1089 5 40 2013-01-01 10:00:00+00:00
3 1576 5 45 2013-01-01 10:00:00+00:00
4 762 6 0 2013-01-01 11:00:00+00:00
year month day dep_time sched_dep_time dep_delay arr_time \
0 2013 1 1 517.0 515 2.0 830.0
1 2013 1 1 533.0 529 4.0 850.0
2 2013 1 1 542.0 540 2.0 923.0
3 2013 1 1 544.0 545 -1.0 1004.0
4 2013 1 1 554.0 600 -6.0 812.0
sched_arr_time arr_delay carrier flight tailnum origin dest air_time \
0 819 11.0 UA 1545 N14228 EWR IAH 227.0
1 830 20.0 UA 1714 N24211 LGA IAH 227.0
2 850 33.0 AA 1141 N619AA JFK MIA 160.0
3 1022 -18.0 B6 725 N804JB JFK BQN 183.0
4 837 -25.0 DL 461 N668DN LGA ATL 116.0
distance hour minute time_hour
0 1400 5 15 2013-01-01 10:00:00+00:00
1 1416 5 29 2013-01-01 10:00:00+00:00
2 1089 5 40 2013-01-01 10:00:00+00:00
3 1576 5 45 2013-01-01 10:00:00+00:00
4 762 6 0 2013-01-01 11:00:00+00:00
year month day dep_time sched_dep_time dep_delay arr_time \
336771 2013 9 30 NaN 1455 NaN NaN
336772 2013 9 30 NaN 2200 NaN NaN
336773 2013 9 30 NaN 1210 NaN NaN
336774 2013 9 30 NaN 1159 NaN NaN
336775 2013 9 30 NaN 840 NaN NaN
sched_arr_time arr_delay carrier flight tailnum origin dest \
336771 1634 NaN 9E 3393 NaN JFK DCA
336772 2312 NaN 9E 3525 NaN LGA SYR
336773 1330 NaN MQ 3461 N535MQ LGA BNA
336774 1344 NaN MQ 3572 N511MQ LGA CLE
336775 1020 NaN MQ 3531 N839MQ LGA RDU
air_time distance hour minute time_hour
336771 NaN 213 14 55 2013-09-30 18:00:00+00:00
336772 NaN 198 22 0 2013-10-01 02:00:00+00:00
336773 NaN 764 12 10 2013-09-30 16:00:00+00:00
336774 NaN 419 11 59 2013-09-30 15:00:00+00:00
336775 NaN 431 8 40 2013-09-30 12:00:00+00:00
Top n
rows with the highest values:
flights.nlargest(n = 5 , columns = 'time_hour' )
year month day dep_time sched_dep_time dep_delay arr_time \
110520 2013 12 31 13.0 2359 14.0 439.0
110521 2013 12 31 18.0 2359 19.0 449.0
111276 2013 12 31 2328.0 2330 -2.0 412.0
111278 2013 12 31 2355.0 2359 -4.0 430.0
111279 2013 12 31 2356.0 2359 -3.0 436.0
sched_arr_time arr_delay carrier flight tailnum origin dest \
110520 437 2.0 B6 839 N566JB JFK BQN
110521 444 5.0 DL 412 N713TW JFK SJU
111276 409 3.0 B6 1389 N651JB EWR SJU
111278 440 -10.0 B6 1503 N509JB JFK SJU
111279 445 -9.0 B6 745 N665JB JFK PSE
air_time distance hour minute time_hour
110520 189.0 1576 23 59 2014-01-01 04:00:00+00:00
110521 192.0 1598 23 59 2014-01-01 04:00:00+00:00
111276 198.0 1608 23 30 2014-01-01 04:00:00+00:00
111278 195.0 1598 23 59 2014-01-01 04:00:00+00:00
111279 200.0 1617 23 59 2014-01-01 04:00:00+00:00
Bottom n
rows with lowest values:
flights.nsmallest(n = 5 , columns = 'time_hour' )
year month day dep_time sched_dep_time dep_delay arr_time \
0 2013 1 1 517.0 515 2.0 830.0
1 2013 1 1 533.0 529 4.0 850.0
2 2013 1 1 542.0 540 2.0 923.0
3 2013 1 1 544.0 545 -1.0 1004.0
5 2013 1 1 554.0 558 -4.0 740.0
sched_arr_time arr_delay carrier flight tailnum origin dest air_time \
0 819 11.0 UA 1545 N14228 EWR IAH 227.0
1 830 20.0 UA 1714 N24211 LGA IAH 227.0
2 850 33.0 AA 1141 N619AA JFK MIA 160.0
3 1022 -18.0 B6 725 N804JB JFK BQN 183.0
5 728 12.0 UA 1696 N39463 EWR ORD 150.0
distance hour minute time_hour
0 1400 5 15 2013-01-01 10:00:00+00:00
1 1416 5 29 2013-01-01 10:00:00+00:00
2 1089 5 40 2013-01-01 10:00:00+00:00
3 1576 5 45 2013-01-01 10:00:00+00:00
5 719 5 58 2013-01-01 10:00:00+00:00
I don’t think nlargest
and nsmallest
apply to grouped DataFrame. But I may be wrong.
5×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time sch ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? Int ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 517 515 2.0 830 ⋯
2 │ 2013 1 1 533 529 4.0 850
3 │ 2013 1 1 542 540 2.0 923
4 │ 2013 1 1 544 545 -1.0 1004
5 │ 2013 1 1 554 600 -6.0 812 ⋯
12 columns omitted
5×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time sch ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? Int ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 517 515 2.0 830 ⋯
2 │ 2013 1 1 533 529 4.0 850
3 │ 2013 1 1 542 540 2.0 923
4 │ 2013 1 1 544 545 -1.0 1004
5 │ 2013 1 1 554 600 -6.0 812 ⋯
12 columns omitted
5×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time sch ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? Int ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2013 9 30 missing 1455 missing missing ⋯
2 │ 2013 9 30 missing 2200 missing missing
3 │ 2013 9 30 missing 1210 missing missing
4 │ 2013 9 30 missing 1159 missing missing
5 │ 2013 9 30 missing 840 missing missing ⋯
12 columns omitted
Top n
rows with the highest values:
last (sort (flights, [: time_hour]), 5 )
5×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time sch ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? Int ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2013 12 31 13 2359 14.0 439 ⋯
2 │ 2013 12 31 18 2359 19.0 449
3 │ 2013 12 31 2328 2330 -2.0 412
4 │ 2013 12 31 2355 2359 -4.0 430
5 │ 2013 12 31 2356 2359 -3.0 436 ⋯
12 columns omitted
Bottom n
rows with lowest values:
first (sort (flights, [: time_hour]), 5 )
5×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time sch ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? Int ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 517 515 2.0 830 ⋯
2 │ 2013 1 1 533 529 4.0 850
3 │ 2013 1 1 542 540 2.0 923
4 │ 2013 1 1 544 545 -1.0 1004
5 │ 2013 1 1 554 558 -4.0 740 ⋯
12 columns omitted
Arrange rows with arrange()
arrange (flights, year, month, day)
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Note input order matters!
arrange (flights, day, month, year)
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Sort in descending order:
arrange (flights, desc (arr_delay)) |>
print (width = Inf )
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 9 641 900 1301 1242 1530
2 2013 6 15 1432 1935 1137 1607 2120
3 2013 1 10 1121 1635 1126 1239 1810
4 2013 9 20 1139 1845 1014 1457 2210
5 2013 7 22 845 1600 1005 1044 1815
6 2013 4 10 1100 1900 960 1342 2211
7 2013 3 17 2321 810 911 135 1020
8 2013 7 22 2257 759 898 121 1026
9 2013 12 5 756 1700 896 1058 2020
10 2013 5 3 1133 2055 878 1250 2215
arr_delay carrier flight tailnum origin dest air_time distance hour minute
<dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1272 HA 51 N384HA JFK HNL 640 4983 9 0
2 1127 MQ 3535 N504MQ JFK CMH 74 483 19 35
3 1109 MQ 3695 N517MQ EWR ORD 111 719 16 35
4 1007 AA 177 N338AA JFK SFO 354 2586 18 45
5 989 MQ 3075 N665MQ JFK CVG 96 589 16 0
6 931 DL 2391 N959DL JFK TPA 139 1005 19 0
7 915 DL 2119 N927DA LGA MSP 167 1020 8 10
8 895 DL 2047 N6716C LGA ATL 109 762 7 59
9 878 AA 172 N5DMAA EWR MIA 149 1085 17 0
10 875 MQ 3744 N523MQ EWR ORD 112 719 20 55
time_hour
<dttm>
1 2013-01-09 09:00:00
2 2013-06-15 19:00:00
3 2013-01-10 16:00:00
4 2013-09-20 18:00:00
5 2013-07-22 16:00:00
6 2013-04-10 19:00:00
7 2013-03-17 08:00:00
8 2013-07-22 07:00:00
9 2013-12-05 17:00:00
10 2013-05-03 20:00:00
# ℹ 336,766 more rows
By default, arrange
ignores grouping in grouped tibbles. Set .by_group = TRUE
to arrange within each group.
# What are the worst delays in each month?
flights |>
group_by (month) |>
arrange (desc (arr_delay), .by_group = TRUE ) |>
print (width = Inf )
# A tibble: 336,776 × 19
# Groups: month [12]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 9 641 900 1301 1242 1530
2 2013 1 10 1121 1635 1126 1239 1810
3 2013 1 1 848 1835 853 1001 1950
4 2013 1 13 1809 810 599 2054 1042
5 2013 1 16 1622 800 502 1911 1054
6 2013 1 23 1551 753 478 1812 1006
7 2013 1 1 2343 1724 379 314 1938
8 2013 1 10 1525 900 385 1713 1039
9 2013 1 25 15 1815 360 208 1958
10 2013 1 2 1607 1030 337 2003 1355
arr_delay carrier flight tailnum origin dest air_time distance hour minute
<dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1272 HA 51 N384HA JFK HNL 640 4983 9 0
2 1109 MQ 3695 N517MQ EWR ORD 111 719 16 35
3 851 MQ 3944 N942MQ JFK BWI 41 184 18 35
4 612 DL 269 N322NB JFK ATL 116 760 8 10
5 497 B6 517 N661JB EWR MCO 144 937 8 0
6 486 DL 2119 N326NB LGA MSP 166 1020 7 53
7 456 EV 4321 N21197 EWR MCI 222 1092 17 24
8 394 UA 544 N419UA LGA ORD 123 733 9 0
9 370 9E 4019 N8646A JFK RIC 56 288 18 15
10 368 AA 179 N324AA JFK SFO 346 2586 10 30
time_hour
<dttm>
1 2013-01-09 09:00:00
2 2013-01-10 16:00:00
3 2013-01-01 18:00:00
4 2013-01-13 08:00:00
5 2013-01-16 08:00:00
6 2013-01-23 07:00:00
7 2013-01-01 17:00:00
8 2013-01-10 09:00:00
9 2013-01-25 18:00:00
10 2013-01-02 10:00:00
# ℹ 336,766 more rows
flights.sort_values(by = 'arr_delay' )
year month day dep_time sched_dep_time dep_delay arr_time \
199668 2013 5 7 1715.0 1729 -14.0 1944.0
211124 2013 5 20 719.0 735 -16.0 951.0
198763 2013 5 6 1826.0 1830 -4.0 2045.0
195236 2013 5 2 1947.0 1949 -2.0 2209.0
196935 2013 5 4 1816.0 1820 -4.0 2017.0
... ... ... ... ... ... ... ...
336771 2013 9 30 NaN 1455 NaN NaN
336772 2013 9 30 NaN 2200 NaN NaN
336773 2013 9 30 NaN 1210 NaN NaN
336774 2013 9 30 NaN 1159 NaN NaN
336775 2013 9 30 NaN 840 NaN NaN
sched_arr_time arr_delay carrier flight tailnum origin dest \
199668 2110 -86.0 VX 193 N843VA EWR SFO
211124 1110 -79.0 VX 11 N840VA JFK SFO
198763 2200 -75.0 AA 269 N3KCAA JFK SEA
195236 2324 -75.0 UA 612 N851UA EWR LAX
196935 2131 -74.0 AS 7 N551AS EWR SEA
... ... ... ... ... ... ... ...
336771 1634 NaN 9E 3393 NaN JFK DCA
336772 2312 NaN 9E 3525 NaN LGA SYR
336773 1330 NaN MQ 3461 N535MQ LGA BNA
336774 1344 NaN MQ 3572 N511MQ LGA CLE
336775 1020 NaN MQ 3531 N839MQ LGA RDU
air_time distance hour minute time_hour
199668 315.0 2565 17 29 2013-05-07 21:00:00+00:00
211124 316.0 2586 7 35 2013-05-20 11:00:00+00:00
198763 289.0 2422 18 30 2013-05-06 22:00:00+00:00
195236 300.0 2454 19 49 2013-05-02 23:00:00+00:00
196935 281.0 2402 18 20 2013-05-04 22:00:00+00:00
... ... ... ... ... ...
336771 NaN 213 14 55 2013-09-30 18:00:00+00:00
336772 NaN 198 22 0 2013-10-01 02:00:00+00:00
336773 NaN 764 12 10 2013-09-30 16:00:00+00:00
336774 NaN 419 11 59 2013-09-30 15:00:00+00:00
336775 NaN 431 8 40 2013-09-30 12:00:00+00:00
[336776 rows x 19 columns]
Sort in descending order:
flights.sort_values(
by = 'arr_delay' ,
ascending = False
)
year month day dep_time sched_dep_time dep_delay arr_time \
7072 2013 1 9 641.0 900 1301.0 1242.0
235778 2013 6 15 1432.0 1935 1137.0 1607.0
8239 2013 1 10 1121.0 1635 1126.0 1239.0
327043 2013 9 20 1139.0 1845 1014.0 1457.0
270376 2013 7 22 845.0 1600 1005.0 1044.0
... ... ... ... ... ... ... ...
336771 2013 9 30 NaN 1455 NaN NaN
336772 2013 9 30 NaN 2200 NaN NaN
336773 2013 9 30 NaN 1210 NaN NaN
336774 2013 9 30 NaN 1159 NaN NaN
336775 2013 9 30 NaN 840 NaN NaN
sched_arr_time arr_delay carrier flight tailnum origin dest \
7072 1530 1272.0 HA 51 N384HA JFK HNL
235778 2120 1127.0 MQ 3535 N504MQ JFK CMH
8239 1810 1109.0 MQ 3695 N517MQ EWR ORD
327043 2210 1007.0 AA 177 N338AA JFK SFO
270376 1815 989.0 MQ 3075 N665MQ JFK CVG
... ... ... ... ... ... ... ...
336771 1634 NaN 9E 3393 NaN JFK DCA
336772 2312 NaN 9E 3525 NaN LGA SYR
336773 1330 NaN MQ 3461 N535MQ LGA BNA
336774 1344 NaN MQ 3572 N511MQ LGA CLE
336775 1020 NaN MQ 3531 N839MQ LGA RDU
air_time distance hour minute time_hour
7072 640.0 4983 9 0 2013-01-09 14:00:00+00:00
235778 74.0 483 19 35 2013-06-15 23:00:00+00:00
8239 111.0 719 16 35 2013-01-10 21:00:00+00:00
327043 354.0 2586 18 45 2013-09-20 22:00:00+00:00
270376 96.0 589 16 0 2013-07-22 20:00:00+00:00
... ... ... ... ... ...
336771 NaN 213 14 55 2013-09-30 18:00:00+00:00
336772 NaN 198 22 0 2013-10-01 02:00:00+00:00
336773 NaN 764 12 10 2013-09-30 16:00:00+00:00
336774 NaN 419 11 59 2013-09-30 15:00:00+00:00
336775 NaN 431 8 40 2013-09-30 12:00:00+00:00
[336776 rows x 19 columns]
To sort within groups (month
)
flights.sort_values(
by = ['month' , 'arr_delay' ],
ascending = [True , False ]
)
year month day dep_time sched_dep_time dep_delay arr_time \
7072 2013 1 9 641.0 900 1301.0 1242.0
8239 2013 1 10 1121.0 1635 1126.0 1239.0
151 2013 1 1 848.0 1835 853.0 1001.0
11063 2013 1 13 1809.0 810 599.0 2054.0
13654 2013 1 16 1622.0 800 502.0 1911.0
... ... ... ... ... ... ... ...
111291 2013 12 31 NaN 705 NaN NaN
111292 2013 12 31 NaN 825 NaN NaN
111293 2013 12 31 NaN 1615 NaN NaN
111294 2013 12 31 NaN 600 NaN NaN
111295 2013 12 31 NaN 830 NaN NaN
sched_arr_time arr_delay carrier flight tailnum origin dest \
7072 1530 1272.0 HA 51 N384HA JFK HNL
8239 1810 1109.0 MQ 3695 N517MQ EWR ORD
151 1950 851.0 MQ 3944 N942MQ JFK BWI
11063 1042 612.0 DL 269 N322NB JFK ATL
13654 1054 497.0 B6 517 N661JB EWR MCO
... ... ... ... ... ... ... ...
111291 931 NaN UA 1729 NaN EWR DEN
111292 1029 NaN US 1831 NaN JFK CLT
111293 1800 NaN MQ 3301 N844MQ LGA RDU
111294 735 NaN UA 219 NaN EWR ORD
111295 1154 NaN UA 443 NaN JFK LAX
air_time distance hour minute time_hour
7072 640.0 4983 9 0 2013-01-09 14:00:00+00:00
8239 111.0 719 16 35 2013-01-10 21:00:00+00:00
151 41.0 184 18 35 2013-01-01 23:00:00+00:00
11063 116.0 760 8 10 2013-01-13 13:00:00+00:00
13654 144.0 937 8 0 2013-01-16 13:00:00+00:00
... ... ... ... ... ...
111291 NaN 1605 7 5 2013-12-31 12:00:00+00:00
111292 NaN 541 8 25 2013-12-31 13:00:00+00:00
111293 NaN 431 16 15 2013-12-31 21:00:00+00:00
111294 NaN 719 6 0 2013-12-31 11:00:00+00:00
111295 NaN 2475 8 30 2013-12-31 13:00:00+00:00
[336776 rows x 19 columns]
Sort in ascending order:
sort (flights, [: arr_delay])
336776×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 5 7 1715 1729 -14.0 1944 ⋯
2 │ 2013 5 20 719 735 -16.0 951
3 │ 2013 5 2 1947 1949 -2.0 2209
4 │ 2013 5 6 1826 1830 -4.0 2045
5 │ 2013 5 4 1816 1820 -4.0 2017 ⋯
6 │ 2013 5 2 1926 1929 -3.0 2157
7 │ 2013 5 6 1753 1755 -2.0 2004
8 │ 2013 5 7 2054 2055 -1.0 2317
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
336770 │ 2013 9 30 559 600 -1.0 missing ⋯
336771 │ 2013 9 30 missing 1842 missing missing
336772 │ 2013 9 30 missing 1455 missing missing
336773 │ 2013 9 30 missing 2200 missing missing
336774 │ 2013 9 30 missing 1210 missing missing ⋯
336775 │ 2013 9 30 missing 1159 missing missing
336776 │ 2013 9 30 missing 840 missing missing
12 columns and 336761 rows omitted
Sort in descending order:
sort (flights, [: arr_delay], rev = true )
336776×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 1525 1530 -5.0 1934 ⋯
2 │ 2013 1 1 1528 1459 29.0 2002
3 │ 2013 1 1 1740 1745 -5.0 2158
4 │ 2013 1 1 1807 1738 29.0 2251
5 │ 2013 1 1 1939 1840 59.0 29 ⋯
6 │ 2013 1 1 1952 1930 22.0 2358
7 │ 2013 1 1 2016 1930 46.0 missing
8 │ 2013 1 1 missing 1630 missing missing
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
336770 │ 2013 5 13 657 700 -3.0 908 ⋯
336771 │ 2013 5 2 1926 1929 -3.0 2157
336772 │ 2013 5 4 1816 1820 -4.0 2017
336773 │ 2013 5 2 1947 1949 -2.0 2209
336774 │ 2013 5 6 1826 1830 -4.0 2045 ⋯
336775 │ 2013 5 20 719 735 -16.0 951
336776 │ 2013 5 7 1715 1729 -14.0 1944
12 columns and 336761 rows omitted
To sort within groups (month
):
sort (flights, [: month, order (: arr_delay, rev= true )])
336776×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 1525 1530 -5.0 1934 ⋯
2 │ 2013 1 1 1528 1459 29.0 2002
3 │ 2013 1 1 1740 1745 -5.0 2158
4 │ 2013 1 1 1807 1738 29.0 2251
5 │ 2013 1 1 1939 1840 59.0 29 ⋯
6 │ 2013 1 1 1952 1930 22.0 2358
7 │ 2013 1 1 2016 1930 46.0 missing
8 │ 2013 1 1 missing 1630 missing missing
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
336770 │ 2013 12 28 625 630 -5.0 916 ⋯
336771 │ 2013 12 29 818 830 -12.0 1056
336772 │ 2013 12 29 1017 1025 -8.0 1304
336773 │ 2013 12 27 1622 1630 -8.0 1908
336774 │ 2013 12 27 1728 1729 -1.0 2011 ⋯
336775 │ 2013 12 27 1833 1836 -3.0 2050
336776 │ 2013 12 27 853 856 -3.0 1052
12 columns and 336761 rows omitted
Manipulate columns (variables)
Select columns with select()
Select columns by variable names:
select (flights, year, month, day)
# A tibble: 336,776 × 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# ℹ 336,766 more rows
flights[['year' , 'month' , 'day' ]]
year month day
0 2013 1 1
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
... ... ... ...
336771 2013 9 30
336772 2013 9 30
336773 2013 9 30
336774 2013 9 30
336775 2013 9 30
[336776 rows x 3 columns]
select (flights, [: year, : month, : day])
336776×3 DataFrame
Row │ year month day
│ Int64 Int64 Int64
────────┼─────────────────────
1 │ 2013 1 1
2 │ 2013 1 1
3 │ 2013 1 1
4 │ 2013 1 1
5 │ 2013 1 1
6 │ 2013 1 1
7 │ 2013 1 1
8 │ 2013 1 1
⋮ │ ⋮ ⋮ ⋮
336770 │ 2013 9 30
336771 │ 2013 9 30
336772 │ 2013 9 30
336773 │ 2013 9 30
336774 │ 2013 9 30
336775 │ 2013 9 30
336776 │ 2013 9 30
336761 rows omitted
Pull values of one column as a vector:
Not displayed because the vector is long.
# Following are same
flights.year
flights.loc[:, 'year' ]
# Return a vector
flights.year
# Return a vector
flights."year"
# Return a vector
flights[!, : year] # does not make a copy
# Return a vector
flights[!, "year" ] # does not make a copy
# Return a vector
flights[: , : year] # make a copy!
# Return a vector
flights[: , "year" ] # make a copy!
Select columns between two variables:
select (flights, year: day)
# A tibble: 336,776 × 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# ℹ 336,766 more rows
flights.loc[:, 'year' :'day' ]
year month day
0 2013 1 1
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
... ... ... ...
336771 2013 9 30
336772 2013 9 30
336773 2013 9 30
336774 2013 9 30
336775 2013 9 30
[336776 rows x 3 columns]
select (flights, Between (: year, : day))
336776×3 DataFrame
Row │ year month day
│ Int64 Int64 Int64
────────┼─────────────────────
1 │ 2013 1 1
2 │ 2013 1 1
3 │ 2013 1 1
4 │ 2013 1 1
5 │ 2013 1 1
6 │ 2013 1 1
7 │ 2013 1 1
8 │ 2013 1 1
⋮ │ ⋮ ⋮ ⋮
336770 │ 2013 9 30
336771 │ 2013 9 30
336772 │ 2013 9 30
336773 │ 2013 9 30
336774 │ 2013 9 30
336775 │ 2013 9 30
336776 │ 2013 9 30
336761 rows omitted
Select all columns except those between two variables:
select (flights, - (year: day))
# A tibble: 336,776 × 16
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
<int> <int> <dbl> <int> <int> <dbl> <chr>
1 517 515 2 830 819 11 UA
2 533 529 4 850 830 20 UA
3 542 540 2 923 850 33 AA
4 544 545 -1 1004 1022 -18 B6
5 554 600 -6 812 837 -25 DL
6 554 558 -4 740 728 12 UA
7 555 600 -5 913 854 19 B6
8 557 600 -3 709 723 -14 EV
9 557 600 -3 838 846 -8 B6
10 558 600 -2 753 745 8 AA
# ℹ 336,766 more rows
# ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights.drop(flights.loc[:, 'year' :'day' ].columns, axis = 1 )
dep_time sched_dep_time dep_delay arr_time sched_arr_time \
0 517.0 515 2.0 830.0 819
1 533.0 529 4.0 850.0 830
2 542.0 540 2.0 923.0 850
3 544.0 545 -1.0 1004.0 1022
4 554.0 600 -6.0 812.0 837
... ... ... ... ... ...
336771 NaN 1455 NaN NaN 1634
336772 NaN 2200 NaN NaN 2312
336773 NaN 1210 NaN NaN 1330
336774 NaN 1159 NaN NaN 1344
336775 NaN 840 NaN NaN 1020
arr_delay carrier flight tailnum origin dest air_time distance \
0 11.0 UA 1545 N14228 EWR IAH 227.0 1400
1 20.0 UA 1714 N24211 LGA IAH 227.0 1416
2 33.0 AA 1141 N619AA JFK MIA 160.0 1089
3 -18.0 B6 725 N804JB JFK BQN 183.0 1576
4 -25.0 DL 461 N668DN LGA ATL 116.0 762
... ... ... ... ... ... ... ... ...
336771 NaN 9E 3393 NaN JFK DCA NaN 213
336772 NaN 9E 3525 NaN LGA SYR NaN 198
336773 NaN MQ 3461 N535MQ LGA BNA NaN 764
336774 NaN MQ 3572 N511MQ LGA CLE NaN 419
336775 NaN MQ 3531 N839MQ LGA RDU NaN 431
hour minute time_hour
0 5 15 2013-01-01 10:00:00+00:00
1 5 29 2013-01-01 10:00:00+00:00
2 5 40 2013-01-01 10:00:00+00:00
3 5 45 2013-01-01 10:00:00+00:00
4 6 0 2013-01-01 11:00:00+00:00
... ... ... ...
336771 14 55 2013-09-30 18:00:00+00:00
336772 22 0 2013-10-01 02:00:00+00:00
336773 12 10 2013-09-30 16:00:00+00:00
336774 11 59 2013-09-30 15:00:00+00:00
336775 8 40 2013-09-30 12:00:00+00:00
[336776 rows x 16 columns]
select (flights, Not (Between (: year, : day)))
336776×16 DataFrame
Row │ dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_d ⋯
│ Int64? Int64 Float64? Int64? Int64 Float ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 517 515 2.0 830 819 ⋯
2 │ 533 529 4.0 850 830
3 │ 542 540 2.0 923 850
4 │ 544 545 -1.0 1004 1022 -
5 │ 554 600 -6.0 812 837 - ⋯
6 │ 554 558 -4.0 740 728
7 │ 555 600 -5.0 913 854
8 │ 557 600 -3.0 709 723 -
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
336770 │ 2349 2359 -10.0 325 350 - ⋯
336771 │ missing 1842 missing missing 2019 missi
336772 │ missing 1455 missing missing 1634 missi
336773 │ missing 2200 missing missing 2312 missi
336774 │ missing 1210 missing missing 1330 missi ⋯
336775 │ missing 1159 missing missing 1344 missi
336776 │ missing 840 missing missing 1020 missi
11 columns and 336761 rows omitted
Select columns by positions:
select (flights, seq (1 , 10 , by = 2 ))
# A tibble: 336,776 × 5
year day sched_dep_time arr_time arr_delay
<int> <int> <int> <int> <dbl>
1 2013 1 515 830 11
2 2013 1 529 850 20
3 2013 1 540 923 33
4 2013 1 545 1004 -18
5 2013 1 600 812 -25
6 2013 1 558 740 12
7 2013 1 600 913 19
8 2013 1 600 709 -14
9 2013 1 600 838 -8
10 2013 1 600 753 8
# ℹ 336,766 more rows
flights.iloc[:, range (0 , 9 , 2 )]
year day sched_dep_time arr_time arr_delay
0 2013 1 515 830.0 11.0
1 2013 1 529 850.0 20.0
2 2013 1 540 923.0 33.0
3 2013 1 545 1004.0 -18.0
4 2013 1 600 812.0 -25.0
... ... ... ... ... ...
336771 2013 30 1455 NaN NaN
336772 2013 30 2200 NaN NaN
336773 2013 30 1210 NaN NaN
336774 2013 30 1159 NaN NaN
336775 2013 30 840 NaN NaN
[336776 rows x 5 columns]
336776×5 DataFrame
Row │ year day sched_dep_time arr_time arr_delay
│ Int64 Int64 Int64 Int64? Float64?
────────┼───────────────────────────────────────────────────
1 │ 2013 1 515 830 11.0
2 │ 2013 1 529 850 20.0
3 │ 2013 1 540 923 33.0
4 │ 2013 1 545 1004 -18.0
5 │ 2013 1 600 812 -25.0
6 │ 2013 1 558 740 12.0
7 │ 2013 1 600 913 19.0
8 │ 2013 1 600 709 -14.0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
336770 │ 2013 30 2359 325 -25.0
336771 │ 2013 30 1842 missing missing
336772 │ 2013 30 1455 missing missing
336773 │ 2013 30 2200 missing missing
336774 │ 2013 30 1210 missing missing
336775 │ 2013 30 1159 missing missing
336776 │ 2013 30 840 missing missing
336761 rows omitted
Move variables to the start of data frame:
select (flights, time_hour, air_time, everything ())
# A tibble: 336,776 × 19
time_hour air_time year month day dep_time sched_dep_time
<dttm> <dbl> <int> <int> <int> <int> <int>
1 2013-01-01 05:00:00 227 2013 1 1 517 515
2 2013-01-01 05:00:00 227 2013 1 1 533 529
3 2013-01-01 05:00:00 160 2013 1 1 542 540
4 2013-01-01 05:00:00 183 2013 1 1 544 545
5 2013-01-01 06:00:00 116 2013 1 1 554 600
6 2013-01-01 05:00:00 150 2013 1 1 554 558
7 2013-01-01 06:00:00 158 2013 1 1 555 600
8 2013-01-01 06:00:00 53 2013 1 1 557 600
9 2013-01-01 06:00:00 140 2013 1 1 557 600
10 2013-01-01 06:00:00 138 2013 1 1 558 600
# ℹ 336,766 more rows
# ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>
Not sure what’s the optimal way to do this.
# Note time_hour is missing in Python dataframe
cols_to_move = ['arr_delay' , 'air_time' ]
flights[cols_to_move + [x for x in flights.columns if x not in cols_to_move]]
arr_delay air_time year month day dep_time sched_dep_time \
0 11.0 227.0 2013 1 1 517.0 515
1 20.0 227.0 2013 1 1 533.0 529
2 33.0 160.0 2013 1 1 542.0 540
3 -18.0 183.0 2013 1 1 544.0 545
4 -25.0 116.0 2013 1 1 554.0 600
... ... ... ... ... ... ... ...
336771 NaN NaN 2013 9 30 NaN 1455
336772 NaN NaN 2013 9 30 NaN 2200
336773 NaN NaN 2013 9 30 NaN 1210
336774 NaN NaN 2013 9 30 NaN 1159
336775 NaN NaN 2013 9 30 NaN 840
dep_delay arr_time sched_arr_time carrier flight tailnum origin \
0 2.0 830.0 819 UA 1545 N14228 EWR
1 4.0 850.0 830 UA 1714 N24211 LGA
2 2.0 923.0 850 AA 1141 N619AA JFK
3 -1.0 1004.0 1022 B6 725 N804JB JFK
4 -6.0 812.0 837 DL 461 N668DN LGA
... ... ... ... ... ... ... ...
336771 NaN NaN 1634 9E 3393 NaN JFK
336772 NaN NaN 2312 9E 3525 NaN LGA
336773 NaN NaN 1330 MQ 3461 N535MQ LGA
336774 NaN NaN 1344 MQ 3572 N511MQ LGA
336775 NaN NaN 1020 MQ 3531 N839MQ LGA
dest distance hour minute time_hour
0 IAH 1400 5 15 2013-01-01 10:00:00+00:00
1 IAH 1416 5 29 2013-01-01 10:00:00+00:00
2 MIA 1089 5 40 2013-01-01 10:00:00+00:00
3 BQN 1576 5 45 2013-01-01 10:00:00+00:00
4 ATL 762 6 0 2013-01-01 11:00:00+00:00
... ... ... ... ... ...
336771 DCA 213 14 55 2013-09-30 18:00:00+00:00
336772 SYR 198 22 0 2013-10-01 02:00:00+00:00
336773 BNA 764 12 10 2013-09-30 16:00:00+00:00
336774 CLE 419 11 59 2013-09-30 15:00:00+00:00
336775 RDU 431 8 40 2013-09-30 12:00:00+00:00
[336776 rows x 19 columns]
select (flights, : time_hour, : air_time, Not ([: time_hour, : air_time]))
336776×19 DataFrame
Row │ time_hour air_time year month day dep_time sched ⋯
│ DateTime Float64? Int64 Int64 Int64 Int64? Int64 ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013-01-01T10:00:00 227.0 2013 1 1 517 ⋯
2 │ 2013-01-01T10:00:00 227.0 2013 1 1 533
3 │ 2013-01-01T10:00:00 160.0 2013 1 1 542
4 │ 2013-01-01T10:00:00 183.0 2013 1 1 544
5 │ 2013-01-01T11:00:00 116.0 2013 1 1 554 ⋯
6 │ 2013-01-01T10:00:00 150.0 2013 1 1 554
7 │ 2013-01-01T11:00:00 158.0 2013 1 1 555
8 │ 2013-01-01T11:00:00 53.0 2013 1 1 557
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
336770 │ 2013-10-01T03:00:00 196.0 2013 9 30 2349 ⋯
336771 │ 2013-09-30T22:00:00 missing 2013 9 30 missing
336772 │ 2013-09-30T18:00:00 missing 2013 9 30 missing
336773 │ 2013-10-01T02:00:00 missing 2013 9 30 missing
336774 │ 2013-09-30T16:00:00 missing 2013 9 30 missing ⋯
336775 │ 2013-09-30T15:00:00 missing 2013 9 30 missing
336776 │ 2013-09-30T12:00:00 missing 2013 9 30 missing
13 columns and 336761 rows omitted
Add new variables with mutate()
A tibble with fewer columns.
flights_sml <-
select (flights, year: day, ends_with ("delay" ), distance, air_time)
flights_sml
# A tibble: 336,776 × 7
year month day dep_delay arr_delay distance air_time
<int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227
2 2013 1 1 4 20 1416 227
3 2013 1 1 2 33 1089 160
4 2013 1 1 -1 -18 1576 183
5 2013 1 1 -6 -25 762 116
6 2013 1 1 -4 12 719 150
7 2013 1 1 -5 19 1065 158
8 2013 1 1 -3 -14 229 53
9 2013 1 1 -3 -8 944 140
10 2013 1 1 -2 8 733 138
# ℹ 336,766 more rows
Is there better way?
import re
cols = ['year' , 'month' , 'day' ] + list (filter (re.compile (".*delay" ).match, flights.columns)) + ['distance' , 'air_time' ]
flights_sml = flights.loc[:, cols]
flights_sml
year month day dep_delay arr_delay distance air_time
0 2013 1 1 2.0 11.0 1400 227.0
1 2013 1 1 4.0 20.0 1416 227.0
2 2013 1 1 2.0 33.0 1089 160.0
3 2013 1 1 -1.0 -18.0 1576 183.0
4 2013 1 1 -6.0 -25.0 762 116.0
... ... ... ... ... ... ... ...
336771 2013 9 30 NaN NaN 213 NaN
336772 2013 9 30 NaN NaN 198 NaN
336773 2013 9 30 NaN NaN 764 NaN
336774 2013 9 30 NaN NaN 419 NaN
336775 2013 9 30 NaN NaN 431 NaN
[336776 rows x 7 columns]
flights_sml = select (flights, Between (: year, : day), r" .* delay $ ", : distance, :air_time )
336776×7 DataFrame
Row │ year month day dep_delay arr_delay distance air_time
│ Int64 Int64 Int64 Float64? Float64? Float64 Float64?
────────┼────────────────────────────────────────────────────────────────
1 │ 2013 1 1 2.0 11.0 1400.0 227.0
2 │ 2013 1 1 4.0 20.0 1416.0 227.0
3 │ 2013 1 1 2.0 33.0 1089.0 160.0
4 │ 2013 1 1 -1.0 -18.0 1576.0 183.0
5 │ 2013 1 1 -6.0 -25.0 762.0 116.0
6 │ 2013 1 1 -4.0 12.0 719.0 150.0
7 │ 2013 1 1 -5.0 19.0 1065.0 158.0
8 │ 2013 1 1 -3.0 -14.0 229.0 53.0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
336770 │ 2013 9 30 -10.0 -25.0 1617.0 196.0
336771 │ 2013 9 30 missing missing 764.0 missing
336772 │ 2013 9 30 missing missing 213.0 missing
336773 │ 2013 9 30 missing missing 198.0 missing
336774 │ 2013 9 30 missing missing 764.0 missing
336775 │ 2013 9 30 missing missing 419.0 missing
336776 │ 2013 9 30 missing missing 431.0 missing
336761 rows omitted
336776×7 DataFrame
Row │ year month day dep_delay arr_delay distance air_time
│ Int64 Int64 Int64 Float64? Float64? Float64 Float64?
────────┼────────────────────────────────────────────────────────────────
1 │ 2013 1 1 2.0 11.0 1400.0 227.0
2 │ 2013 1 1 4.0 20.0 1416.0 227.0
3 │ 2013 1 1 2.0 33.0 1089.0 160.0
4 │ 2013 1 1 -1.0 -18.0 1576.0 183.0
5 │ 2013 1 1 -6.0 -25.0 762.0 116.0
6 │ 2013 1 1 -4.0 12.0 719.0 150.0
7 │ 2013 1 1 -5.0 19.0 1065.0 158.0
8 │ 2013 1 1 -3.0 -14.0 229.0 53.0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
336770 │ 2013 9 30 -10.0 -25.0 1617.0 196.0
336771 │ 2013 9 30 missing missing 764.0 missing
336772 │ 2013 9 30 missing missing 213.0 missing
336773 │ 2013 9 30 missing missing 198.0 missing
336774 │ 2013 9 30 missing missing 764.0 missing
336775 │ 2013 9 30 missing missing 419.0 missing
336776 │ 2013 9 30 missing missing 431.0 missing
336761 rows omitted
Add variables gain
and speed
:
mutate (
flights_sml,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60
)
# A tibble: 336,776 × 9
year month day dep_delay arr_delay distance air_time gain speed
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227 9 370.
2 2013 1 1 4 20 1416 227 16 374.
3 2013 1 1 2 33 1089 160 31 408.
4 2013 1 1 -1 -18 1576 183 -17 517.
5 2013 1 1 -6 -25 762 116 -19 394.
6 2013 1 1 -4 12 719 150 16 288.
7 2013 1 1 -5 19 1065 158 24 404.
8 2013 1 1 -3 -14 229 53 -11 259.
9 2013 1 1 -3 -8 944 140 -5 405.
10 2013 1 1 -2 8 733 138 10 319.
# ℹ 336,766 more rows
flights_sml['gain' ] = flights_sml['arr_delay' ] - flights_sml['dep_delay' ]
flights_sml['speed' ] = flights_sml['distance' ] / flights_sml['air_time' ] * 60
flights_sml
year month day dep_delay arr_delay distance air_time gain \
0 2013 1 1 2.0 11.0 1400 227.0 9.0
1 2013 1 1 4.0 20.0 1416 227.0 16.0
2 2013 1 1 2.0 33.0 1089 160.0 31.0
3 2013 1 1 -1.0 -18.0 1576 183.0 -17.0
4 2013 1 1 -6.0 -25.0 762 116.0 -19.0
... ... ... ... ... ... ... ... ...
336771 2013 9 30 NaN NaN 213 NaN NaN
336772 2013 9 30 NaN NaN 198 NaN NaN
336773 2013 9 30 NaN NaN 764 NaN NaN
336774 2013 9 30 NaN NaN 419 NaN NaN
336775 2013 9 30 NaN NaN 431 NaN NaN
speed
0 370.044053
1 374.273128
2 408.375000
3 516.721311
4 394.137931
... ...
336771 NaN
336772 NaN
336773 NaN
336774 NaN
336775 NaN
[336776 rows x 9 columns]
Julia analog is transform
:
# Following are equivalent
transform (flights_sml, [: arr_delay, : dep_delay] => (- ) => : gain)
336776×8 DataFrame
Row │ year month day dep_delay arr_delay distance air_time gain ⋯
│ Int64 Int64 Int64 Float64? Float64? Float64 Float64? Floa ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 2.0 11.0 1400.0 227.0 ⋯
2 │ 2013 1 1 4.0 20.0 1416.0 227.0
3 │ 2013 1 1 2.0 33.0 1089.0 160.0
4 │ 2013 1 1 -1.0 -18.0 1576.0 183.0
5 │ 2013 1 1 -6.0 -25.0 762.0 116.0 ⋯
6 │ 2013 1 1 -4.0 12.0 719.0 150.0
7 │ 2013 1 1 -5.0 19.0 1065.0 158.0
8 │ 2013 1 1 -3.0 -14.0 229.0 53.0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
336770 │ 2013 9 30 -10.0 -25.0 1617.0 196.0 ⋯
336771 │ 2013 9 30 missing missing 764.0 missing miss
336772 │ 2013 9 30 missing missing 213.0 missing miss
336773 │ 2013 9 30 missing missing 198.0 missing miss
336774 │ 2013 9 30 missing missing 764.0 missing miss ⋯
336775 │ 2013 9 30 missing missing 419.0 missing miss
336776 │ 2013 9 30 missing missing 431.0 missing miss
1 column and 336761 rows omitted
insertcols! (flights_sml, : gain => flights.arr_delay - flights.dep_delay)
336776×8 DataFrame
Row │ year month day dep_delay arr_delay distance air_time gain ⋯
│ Int64 Int64 Int64 Float64? Float64? Float64 Float64? Floa ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 2.0 11.0 1400.0 227.0 ⋯
2 │ 2013 1 1 4.0 20.0 1416.0 227.0
3 │ 2013 1 1 2.0 33.0 1089.0 160.0
4 │ 2013 1 1 -1.0 -18.0 1576.0 183.0
5 │ 2013 1 1 -6.0 -25.0 762.0 116.0 ⋯
6 │ 2013 1 1 -4.0 12.0 719.0 150.0
7 │ 2013 1 1 -5.0 19.0 1065.0 158.0
8 │ 2013 1 1 -3.0 -14.0 229.0 53.0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
336770 │ 2013 9 30 -10.0 -25.0 1617.0 196.0 ⋯
336771 │ 2013 9 30 missing missing 764.0 missing miss
336772 │ 2013 9 30 missing missing 213.0 missing miss
336773 │ 2013 9 30 missing missing 198.0 missing miss
336774 │ 2013 9 30 missing missing 764.0 missing miss ⋯
336775 │ 2013 9 30 missing missing 419.0 missing miss
336776 │ 2013 9 30 missing missing 431.0 missing miss
1 column and 336761 rows omitted
Refer to columns that you’ve just created:
mutate (flights_sml,
gain = arr_delay - dep_delay,
hours = air_time / 60 ,
gain_per_hour = gain / hours
)
# A tibble: 336,776 × 10
year month day dep_delay arr_delay distance air_time gain hours
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227 9 3.78
2 2013 1 1 4 20 1416 227 16 3.78
3 2013 1 1 2 33 1089 160 31 2.67
4 2013 1 1 -1 -18 1576 183 -17 3.05
5 2013 1 1 -6 -25 762 116 -19 1.93
6 2013 1 1 -4 12 719 150 16 2.5
7 2013 1 1 -5 19 1065 158 24 2.63
8 2013 1 1 -3 -14 229 53 -11 0.883
9 2013 1 1 -3 -8 944 140 -5 2.33
10 2013 1 1 -2 8 733 138 10 2.3
# ℹ 336,766 more rows
# ℹ 1 more variable: gain_per_hour <dbl>
Not sure how to refer to columns in the same command.
Not sure how to do this, except using two lines.
# Following are equivalent
@pipe flights |>
transform (
_ ,
[: arr_delay, : dep_delay] => (- ) => : gain,
[: air_time] => (x -> x / 60 ) => : hours,
) |>
transform (
_,
[: gain, : hours] => ByRow (/ ) => : gain_per_hour
)
336776×22 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 517 515 2.0 830 ⋯
2 │ 2013 1 1 533 529 4.0 850
3 │ 2013 1 1 542 540 2.0 923
4 │ 2013 1 1 544 545 -1.0 1004
5 │ 2013 1 1 554 600 -6.0 812 ⋯
6 │ 2013 1 1 554 558 -4.0 740
7 │ 2013 1 1 555 600 -5.0 913
8 │ 2013 1 1 557 600 -3.0 709
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
336770 │ 2013 9 30 2349 2359 -10.0 325 ⋯
336771 │ 2013 9 30 missing 1842 missing missing
336772 │ 2013 9 30 missing 1455 missing missing
336773 │ 2013 9 30 missing 2200 missing missing
336774 │ 2013 9 30 missing 1210 missing missing ⋯
336775 │ 2013 9 30 missing 1159 missing missing
336776 │ 2013 9 30 missing 840 missing missing
15 columns and 336761 rows omitted
Only keep the new variables by transmute()
:
transmute (
flights,
gain = arr_delay - dep_delay,
hours = air_time / 60 ,
gain_per_hour = gain / hours
)
# A tibble: 336,776 × 3
gain hours gain_per_hour
<dbl> <dbl> <dbl>
1 9 3.78 2.38
2 16 3.78 4.23
3 31 2.67 11.6
4 -17 3.05 -5.57
5 -19 1.93 -9.83
6 16 2.5 6.4
7 24 2.63 9.11
8 -11 0.883 -12.5
9 -5 2.33 -2.14
10 10 2.3 4.35
# ℹ 336,766 more rows
mutate_all()
: apply funs to all columns.
mutate_all (data, funs (log (.), log2 (.)))
mutate_at()
: apply funs to specific columns.
mutate_at (data, vars (- Species), funs (log (.)))
mutate_if()
: apply funs of one type
mutate_if (data, is.numeric, funs (log (.)))
Summaries
Summaries with summarise()
summarise (flights, delay = mean (dep_delay, na.rm = TRUE ))
# A tibble: 1 × 1
delay
<dbl>
1 12.6
flights.agg({'dep_delay' : np.mean})
dep_delay 12.63907
dtype: float64
<string>:1: FutureWarning: The provided callable <function mean at 0xffff63fe85e0> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
combine (flights, : dep_delay => (x -> mean (skipmissing (x))) => : delay)
1×1 DataFrame
Row │ delay
│ Float64
─────┼─────────
1 │ 12.6391
Convert a tibble into a grouped tibble:
by_day <- group_by (flights, year, month, day) |>
print (width = Inf )
# A tibble: 336,776 × 19
# Groups: year, month, day [365]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
arr_delay carrier flight tailnum origin dest air_time distance hour minute
<dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
5 -25 DL 461 N668DN LGA ATL 116 762 6 0
6 12 UA 1696 N39463 EWR ORD 150 719 5 58
7 19 B6 507 N516JB EWR FLL 158 1065 6 0
8 -14 EV 5708 N829AS LGA IAD 53 229 6 0
9 -8 B6 79 N593JB JFK MCO 140 944 6 0
10 8 AA 301 N3ALAA LGA ORD 138 733 6 0
time_hour
<dttm>
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
3 2013-01-01 05:00:00
4 2013-01-01 05:00:00
5 2013-01-01 06:00:00
6 2013-01-01 05:00:00
7 2013-01-01 06:00:00
8 2013-01-01 06:00:00
9 2013-01-01 06:00:00
10 2013-01-01 06:00:00
# ℹ 336,766 more rows
by_day = flights.groupby(['year' , 'month' , 'day' ])
by_day
<pandas.core.groupby.generic.DataFrameGroupBy object at 0xffff318000a0>
by_day = groupby (flights, [: year, : month, : day])
GroupedDataFrame with 365 groups based on keys: year, month, day
First Group (842 rows): year = 2013, month = 1, day = 1
Row │ year month day dep_time sched_dep_time dep_delay arr_time sch ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? Int ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 517 515 2.0 830 ⋯
2 │ 2013 1 1 533 529 4.0 850
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
841 │ 2013 1 1 missing 1500 missing missing
842 │ 2013 1 1 missing 600 missing missing
12 columns and 838 rows omitted
⋮
Last Group (776 rows): year = 2013, month = 12, day = 31
Row │ year month day dep_time sched_dep_time dep_delay arr_time sch ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? Int ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2013 12 31 13 2359 14.0 439 ⋯
2 │ 2013 12 31 18 2359 19.0 449
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
776 │ 2013 12 31 missing 830 missing missing
12 columns and 773 rows omitted
GroupedDataFrame with 365 groups based on keys: year, month, day
First Group (842 rows): year = 2013, month = 1, day = 1
Row │ year month day dep_time sched_dep_time dep_delay arr_time sch ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? Int ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 517 515 2.0 830 ⋯
2 │ 2013 1 1 533 529 4.0 850
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
841 │ 2013 1 1 missing 1500 missing missing
842 │ 2013 1 1 missing 600 missing missing
12 columns and 838 rows omitted
⋮
Last Group (776 rows): year = 2013, month = 12, day = 31
Row │ year month day dep_time sched_dep_time dep_delay arr_time sch ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? Int ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2013 12 31 13 2359 14.0 439 ⋯
2 │ 2013 12 31 18 2359 19.0 449
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
776 │ 2013 12 31 missing 830 missing missing
12 columns and 773 rows omitted
summarise (by_day, delay = mean (dep_delay, na.rm = TRUE ))
# A tibble: 365 × 4
# Groups: year, month [12]
year month day delay
<int> <int> <int> <dbl>
1 2013 1 1 11.5
2 2013 1 2 13.9
3 2013 1 3 11.0
4 2013 1 4 8.95
5 2013 1 5 5.73
6 2013 1 6 7.15
7 2013 1 7 5.42
8 2013 1 8 2.55
9 2013 1 9 2.28
10 2013 1 10 2.84
# ℹ 355 more rows
Pipe
Consider following analysis (find destinations excluding HNL
that have >20 flights, and calculate the average distances and arrival delay):
by_dest <- group_by (flights, dest)
delay <- summarise (by_dest, count = n (),
dist = mean (distance, na.rm = TRUE ),
delay = mean (arr_delay, na.rm = TRUE )
)
delay <- filter (delay, count > 20 , dest != "HNL" )
delay
# A tibble: 96 × 4
dest count dist delay
<chr> <int> <dbl> <dbl>
1 ABQ 254 1826 4.38
2 ACK 265 199 4.85
3 ALB 439 143 14.4
4 ATL 17215 757. 11.3
5 AUS 2439 1514. 6.02
6 AVL 275 584. 8.00
7 BDL 443 116 7.05
8 BGR 375 378 8.03
9 BHM 297 866. 16.9
10 BNA 6333 758. 11.8
# ℹ 86 more rows
Cleaner code using pipe |>
:
delays <- flights |>
group_by (dest) |>
summarise (
count = n (),
dist = mean (distance, na.rm = TRUE ),
delay = mean (arr_delay, na.rm = TRUE )
) |>
filter (count > 20 , dest != "HNL" )
delays
# A tibble: 96 × 4
dest count dist delay
<chr> <int> <dbl> <dbl>
1 ABQ 254 1826 4.38
2 ACK 265 199 4.85
3 ALB 439 143 14.4
4 ATL 17215 757. 11.3
5 AUS 2439 1514. 6.02
6 AVL 275 584. 8.00
7 BDL 443 116 7.05
8 BGR 375 378 8.03
9 BHM 297 866. 16.9
10 BNA 6333 758. 11.8
# ℹ 86 more rows
ggplot2 accepts pipe too.
delays |>
ggplot (mapping = aes (x = dist, y = delay)) +
geom_point (aes (size = count), alpha = 1 / 3 ) +
geom_smooth (se = FALSE ) +
labs (x = "Distance from NYC (miles)" ,
y = "Arrival delay (mins)" )
Other summary functions
Location: mean(x)
, median(x)
.
# Equivalent code using filter
# not_cancelled <- flights |>
# filter(!is.na(dep_delay), !is.na(arr_delay)) |>
# print(width = Inf)
not_cancelled <- flights |>
drop_na (dep_delay, arr_delay) |>
print (width = Inf )
# A tibble: 327,346 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
arr_delay carrier flight tailnum origin dest air_time distance hour minute
<dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
5 -25 DL 461 N668DN LGA ATL 116 762 6 0
6 12 UA 1696 N39463 EWR ORD 150 719 5 58
7 19 B6 507 N516JB EWR FLL 158 1065 6 0
8 -14 EV 5708 N829AS LGA IAD 53 229 6 0
9 -8 B6 79 N593JB JFK MCO 140 944 6 0
10 8 AA 301 N3ALAA LGA ORD 138 733 6 0
time_hour
<dttm>
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
3 2013-01-01 05:00:00
4 2013-01-01 05:00:00
5 2013-01-01 06:00:00
6 2013-01-01 05:00:00
7 2013-01-01 06:00:00
8 2013-01-01 06:00:00
9 2013-01-01 06:00:00
10 2013-01-01 06:00:00
# ℹ 327,336 more rows
not_cancelled |>
group_by (year, month, day) |>
summarise (
avg_delay1 = mean (arr_delay),
avg_delay2 = mean (arr_delay[arr_delay > 0 ]), # the average positive delay
)
# A tibble: 365 × 5
# Groups: year, month [12]
year month day avg_delay1 avg_delay2
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 12.7 32.5
2 2013 1 2 12.7 32.0
3 2013 1 3 5.73 27.7
4 2013 1 4 -1.93 28.3
5 2013 1 5 -1.53 22.6
6 2013 1 6 4.24 24.4
7 2013 1 7 -4.95 27.8
8 2013 1 8 -3.23 20.8
9 2013 1 9 -0.264 25.6
10 2013 1 10 -5.90 27.3
# ℹ 355 more rows
Question: why is the day
group dropped?
not_cancelled = flights.dropna(subset = ['dep_delay' , 'arr_delay' ])
not_cancelled
year month day dep_time sched_dep_time dep_delay arr_time \
0 2013 1 1 517.0 515 2.0 830.0
1 2013 1 1 533.0 529 4.0 850.0
2 2013 1 1 542.0 540 2.0 923.0
3 2013 1 1 544.0 545 -1.0 1004.0
4 2013 1 1 554.0 600 -6.0 812.0
... ... ... ... ... ... ... ...
336765 2013 9 30 2240.0 2245 -5.0 2334.0
336766 2013 9 30 2240.0 2250 -10.0 2347.0
336767 2013 9 30 2241.0 2246 -5.0 2345.0
336768 2013 9 30 2307.0 2255 12.0 2359.0
336769 2013 9 30 2349.0 2359 -10.0 325.0
sched_arr_time arr_delay carrier flight tailnum origin dest \
0 819 11.0 UA 1545 N14228 EWR IAH
1 830 20.0 UA 1714 N24211 LGA IAH
2 850 33.0 AA 1141 N619AA JFK MIA
3 1022 -18.0 B6 725 N804JB JFK BQN
4 837 -25.0 DL 461 N668DN LGA ATL
... ... ... ... ... ... ... ...
336765 2351 -17.0 B6 1816 N354JB JFK SYR
336766 7 -20.0 B6 2002 N281JB JFK BUF
336767 1 -16.0 B6 486 N346JB JFK ROC
336768 2358 1.0 B6 718 N565JB JFK BOS
336769 350 -25.0 B6 745 N516JB JFK PSE
air_time distance hour minute time_hour
0 227.0 1400 5 15 2013-01-01 10:00:00+00:00
1 227.0 1416 5 29 2013-01-01 10:00:00+00:00
2 160.0 1089 5 40 2013-01-01 10:00:00+00:00
3 183.0 1576 5 45 2013-01-01 10:00:00+00:00
4 116.0 762 6 0 2013-01-01 11:00:00+00:00
... ... ... ... ... ...
336765 41.0 209 22 45 2013-10-01 02:00:00+00:00
336766 52.0 301 22 50 2013-10-01 02:00:00+00:00
336767 47.0 264 22 46 2013-10-01 02:00:00+00:00
336768 33.0 187 22 55 2013-10-01 02:00:00+00:00
336769 196.0 1617 23 59 2013-10-01 03:00:00+00:00
[327346 rows x 19 columns]
flights.groupby(['year' , 'month' , 'day' ]).agg(
avg_delay1 = ('arr_delay' , np.mean),
avg_delay2 = ('arr_delay' , lambda x: np.mean(x[x > 0 ]))
)
avg_delay1 avg_delay2
year month day
2013 1 1 12.651023 32.481562
2 12.692888 32.029907
3 5.733333 27.660870
4 -1.932819 28.309764
5 -1.525802 22.558824
6 4.236429 24.372703
7 -4.947312 27.761317
8 -3.227578 20.789091
9 -0.264278 25.634146
10 -5.898816 27.345455
11 -4.762268 26.159836
12 -13.016153 23.471698
13 14.931846 52.548913
14 3.640303 22.444444
15 0.425653 19.710027
16 34.247362 46.083458
17 6.492896 25.602679
18 1.841758 25.053824
19 -8.526080 24.289062
20 3.740077 29.029126
21 6.315965 32.243590
22 12.276836 33.883227
23 6.917793 33.051813
24 15.427313 42.088235
25 27.098976 55.736944
26 0.751118 29.058333
27 -1.308933 36.024793
28 9.319720 38.703608
29 -6.555811 30.133333
30 25.910804 57.428894
31 32.602854 51.590444
2 1 7.165198 30.290865
2 -4.788774 26.194595
3 0.025189 28.443299
4 7.507625 33.155172
5 6.766249 23.292735
6 -1.282511 29.103203
7 2.069189 26.704735
8 24.228571 35.454810
9 6.639175 37.007874
10 6.659176 38.445455
11 36.290094 62.482014
12 -2.144796 26.017182
13 0.026637 22.653740
14 4.777311 26.404651
15 0.769556 24.400524
16 0.203804 29.846405
17 -1.108434 35.627820
18 -2.730074 31.022989
19 12.819849 38.548807
20 3.049303 33.492537
21 7.915598 28.701493
22 10.978655 32.619522
23 14.711957 35.622871
24 1.093211 30.911184
25 1.371822 30.538682
26 1.715719 29.851064
27 31.252492 59.044828
28 -9.168940 23.614173
3 1 -0.690678 35.070796
2 -3.257294 32.088462
3 -6.398015 28.093385
4 -2.333333 28.114286
5 -3.888532 28.861345
6 6.972187 46.834437
7 16.230420 44.491342
8 85.862155 97.485315
9 -7.023936 34.135593
10 -2.080931 35.546125
11 -0.844969 27.080357
12 22.556049 50.529297
13 -3.333333 26.852201
14 3.667347 33.248082
15 0.590722 36.420245
16 11.131406 30.976019
17 6.551225 25.390909
18 41.291892 64.746457
19 26.489201 44.734921
20 2.374086 28.238845
21 3.508230 30.382653
22 3.284979 33.024457
23 13.767105 36.159036
24 11.166479 39.157407
25 10.370370 44.251208
26 -7.690476 21.831169
27 -6.323409 25.738318
28 -6.454918 32.087156
29 -9.594845 29.128834
30 -13.737598 28.849057
31 -0.762332 30.641577
4 1 10.847758 35.018828
2 4.581633 31.492308
3 0.103239 21.008000
4 5.278119 28.138085
5 5.644535 28.078261
6 -6.324641 27.707182
7 -4.072767 23.301653
8 -6.421053 24.801802
9 0.383090 32.686957
10 38.412311 94.130641
11 20.499475 47.626168
12 36.048140 56.411765
13 4.370757 25.724928
14 6.862637 31.294118
15 0.112348 26.695652
16 2.969664 46.575092
17 9.867089 40.985258
18 36.028481 64.895175
19 47.911697 80.048780
20 6.482759 31.100575
21 4.371460 28.240409
22 37.812166 56.491971
23 22.952329 40.163462
24 23.461373 50.996219
25 33.681250 45.829333
26 21.372188 42.631579
27 -12.046358 27.902985
28 -5.185919 26.915709
29 -3.382805 35.463768
30 -10.514166 23.260465
5 1 -8.381497 23.833333
2 -7.252317 31.076596
3 4.230530 42.539945
4 -8.537940 35.557377
5 -9.772026 30.853933
6 -7.578029 30.901141
7 -14.367089 24.167630
8 39.609183 69.049153
9 11.840415 36.572549
10 9.017562 37.834821
11 22.773826 61.501425
12 -6.227222 23.460076
13 -6.364198 29.882979
14 -11.105485 30.681818
15 -2.029598 34.597865
16 4.196939 34.097884
17 1.093878 29.060274
18 -10.738606 30.571429
19 19.513333 55.668817
20 5.846473 39.297030
21 4.079114 30.686420
22 27.593853 56.271881
23 61.970899 88.501792
24 24.257415 48.242583
25 0.053867 34.650655
26 -11.027473 19.185714
27 -9.673514 28.415301
28 1.101344 33.385075
29 0.267635 25.632432
30 -0.161389 38.012780
31 -4.627593 34.295082
6 1 -11.861148 26.272727
2 26.075518 72.118687
3 23.010460 49.567857
4 -0.390167 25.123494
5 -3.104384 30.104089
6 0.926272 25.843575
7 27.973481 52.932862
8 -3.540155 34.458937
9 -2.321903 27.858696
10 28.022293 57.427549
11 2.135135 36.368116
12 -0.101784 25.011494
13 63.753689 77.495957
14 18.910603 48.978346
15 -4.268939 37.066667
16 -0.941048 27.618893
17 29.360287 59.349040
18 37.648026 63.436242
19 7.245632 25.152475
20 -7.135765 20.145833
21 -3.197769 33.905797
22 -5.814907 31.156250
23 7.666667 43.331565
24 51.176808 92.093985
25 41.513684 73.197347
26 27.317410 59.994253
27 44.783296 88.440546
28 44.976852 78.172414
29 11.073325 39.922460
30 43.510278 75.887218
7 1 58.280502 76.901146
2 14.966486 42.913223
3 17.257467 48.211155
4 -12.099591 26.457364
5 -8.844743 33.886076
6 -7.516046 36.117647
7 40.306378 86.342213
8 29.648847 62.889908
9 31.334365 71.164773
10 59.626478 110.093439
11 16.524678 44.407484
12 20.116989 47.355993
13 13.539241 42.950980
14 -8.483766 42.163462
15 -9.598385 23.768061
16 3.467413 32.411330
17 4.868154 38.806701
18 12.832132 51.722477
19 11.723644 46.004545
20 8.832908 38.854342
21 15.459579 45.089087
22 62.763403 102.343470
23 44.959821 69.958132
24 14.073171 34.338156
25 17.404082 42.024390
26 10.337398 31.212355
27 0.248756 34.088462
28 49.831776 86.632163
29 11.453988 36.544699
30 -1.347870 28.543478
31 -0.892354 27.344937
8 1 35.989259 56.319508
2 8.565884 33.083512
3 9.414880 33.488000
4 4.085963 29.500000
5 2.009027 27.671916
6 -2.051724 27.094463
7 8.192153 31.287785
8 55.481163 78.624638
9 43.313641 62.330914
10 4.804483 29.857558
11 6.603896 29.613426
12 7.590313 26.316103
13 29.825107 44.047757
14 13.804656 29.067769
15 1.643863 27.269129
16 -1.842265 27.611285
17 -7.209781 25.130000
18 -1.126932 28.290429
19 -6.434211 22.300000
20 -8.207317 24.676596
21 -9.823649 22.323810
22 29.976744 53.612903
23 -0.666328 26.029674
24 -13.894942 32.000000
25 -15.030000 30.555556
26 -13.246660 27.606936
27 -11.759916 29.491713
28 35.203074 75.326848
29 -0.692387 28.875706
30 -12.035639 37.154762
31 -9.907738 31.934959
9 1 -9.390449 34.632353
2 45.518430 102.811364
3 -3.319194 27.251701
4 -15.092850 25.413793
5 -15.540373 26.259259
6 -17.895010 27.282609
7 -20.349854 23.984615
8 -11.545455 22.047120
9 -11.701417 20.015075
10 -9.033438 29.628141
11 7.982475 49.561308
12 58.912418 100.024540
13 13.457819 38.092885
14 -14.703377 30.863636
15 -9.751964 23.060773
16 -3.620549 20.321918
17 -10.257829 17.824859
18 -12.139319 36.620968
19 2.814508 46.939103
20 -4.093878 40.828571
21 -4.934783 29.631868
22 -1.705357 26.658703
23 -4.345491 24.832168
24 -9.909378 16.851648
25 -5.546680 19.596154
26 -4.643145 20.578231
27 -7.593939 21.831967
28 -16.371852 22.186916
29 -8.478022 27.164319
30 -11.641988 25.601227
10 1 -18.959375 24.242424
2 -13.080164 19.128571
3 4.260606 33.699739
4 -4.868421 26.890977
5 -9.113703 31.014815
6 -5.759300 23.773279
7 39.017260 72.483304
8 -6.169792 21.547893
9 -5.874871 24.808765
10 10.574359 36.680672
11 18.922995 51.251037
12 -9.684993 30.144000
13 -2.464365 25.470000
14 -4.509165 21.857651
15 -3.469045 25.148649
16 -3.857585 19.110368
17 6.635996 30.467849
18 3.928355 25.672289
19 -1.076358 17.848249
20 -1.563457 18.818462
21 -1.755826 16.853107
22 0.003122 21.531073
23 4.535052 21.614256
24 0.208925 20.438830
25 7.619919 32.589286
26 0.735988 27.076596
27 -5.429360 30.230088
28 -2.207354 22.406832
29 -3.220481 21.645062
30 -4.075335 22.381757
31 0.911572 24.745042
11 1 11.268162 34.353191
2 -2.871492 16.882609
3 -3.916667 24.677419
4 3.461934 20.141079
5 -5.169615 19.446970
6 -4.185685 21.406130
7 13.168050 31.615101
8 0.319714 21.318052
9 -8.877980 15.695035
10 1.540965 28.296296
11 -4.524004 24.914062
12 10.111111 29.271914
13 -10.976362 22.168889
14 -8.560529 22.403141
15 -4.681263 17.946996
16 -7.066104 23.722581
17 20.591014 53.404545
18 0.636458 28.200000
19 -2.766839 19.267101
20 -3.144764 24.118033
21 0.212487 21.065104
22 9.184662 30.304950
23 3.813514 23.990881
24 5.837472 35.022099
25 -5.129274 23.016260
26 6.232487 26.987041
27 17.534289 38.413445
28 -3.097792 31.231214
29 -12.180303 20.621359
30 -11.900468 26.166667
12 1 -0.976507 32.912752
2 0.450902 31.554878
3 2.222798 24.459259
4 -0.222222 28.406433
5 51.666255 80.835714
6 21.119469 37.510888
7 5.057057 27.187075
8 36.911801 56.160000
9 42.575556 53.073826
10 44.508796 52.194401
11 10.902439 24.905594
12 5.339937 23.463203
13 0.163032 22.141210
14 46.397504 59.978022
15 15.870098 38.986784
16 7.370681 27.813853
17 55.871856 60.467949
18 11.079832 26.595027
19 9.006211 31.040084
20 13.820166 35.249049
21 12.456929 36.855392
22 23.899549 45.648746
23 32.226042 50.033973
24 -1.043883 21.169611
25 -1.899301 30.095000
26 7.174194 34.548694
27 -0.148803 29.046832
28 -3.259533 25.607692
29 18.763825 47.256356
30 10.057712 31.243802
31 6.212121 24.455959
<string>:1: FutureWarning: The provided callable <function mean at 0xffff63fe85e0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
not_cancelled = dropmissing (flights, [: dep_delay, : arr_delay])
327346×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64 Int64? ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 517 515 2.0 830 ⋯
2 │ 2013 1 1 533 529 4.0 850
3 │ 2013 1 1 542 540 2.0 923
4 │ 2013 1 1 544 545 -1.0 1004
5 │ 2013 1 1 554 600 -6.0 812 ⋯
6 │ 2013 1 1 554 558 -4.0 740
7 │ 2013 1 1 555 600 -5.0 913
8 │ 2013 1 1 557 600 -3.0 709
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
327340 │ 2013 9 30 2235 2001 154.0 59 ⋯
327341 │ 2013 9 30 2237 2245 -8.0 2345
327342 │ 2013 9 30 2240 2245 -5.0 2334
327343 │ 2013 9 30 2240 2250 -10.0 2347
327344 │ 2013 9 30 2241 2246 -5.0 2345 ⋯
327345 │ 2013 9 30 2307 2255 12.0 2359
327346 │ 2013 9 30 2349 2359 -10.0 325
12 columns and 327331 rows omitted
327346×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64 Int64? ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 517 515 2.0 830 ⋯
2 │ 2013 1 1 533 529 4.0 850
3 │ 2013 1 1 542 540 2.0 923
4 │ 2013 1 1 544 545 -1.0 1004
5 │ 2013 1 1 554 600 -6.0 812 ⋯
6 │ 2013 1 1 554 558 -4.0 740
7 │ 2013 1 1 555 600 -5.0 913
8 │ 2013 1 1 557 600 -3.0 709
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
327340 │ 2013 9 30 2235 2001 154.0 59 ⋯
327341 │ 2013 9 30 2237 2245 -8.0 2345
327342 │ 2013 9 30 2240 2245 -5.0 2334
327343 │ 2013 9 30 2240 2250 -10.0 2347
327344 │ 2013 9 30 2241 2246 -5.0 2345 ⋯
327345 │ 2013 9 30 2307 2255 12.0 2359
327346 │ 2013 9 30 2349 2359 -10.0 325
12 columns and 327331 rows omitted
@pipe not_cancelled |>
groupby (_, [: year, : month, : day]) |>
combine (
_,
: arr_delay => (x -> [(mean (x), mean (skipmissing (x[x .>= 0 ])))]) => [: avg_delay1, : avg_delay2]
)
365×5 DataFrame
Row │ year month day avg_delay1 avg_delay2
│ Int64 Int64 Int64 Float64 Float64
─────┼─────────────────────────────────────────────
1 │ 2013 1 1 12.651 31.5907
2 │ 2013 1 2 12.6929 30.9314
3 │ 2013 1 3 5.73333 26.2351
4 │ 2013 1 4 -1.93282 27.1226
5 │ 2013 1 5 -1.5258 20.9727
6 │ 2013 1 6 4.23643 22.9284
7 │ 2013 1 7 -4.94731 25.8467
8 │ 2013 1 8 -3.22758 19.5788
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
359 │ 2013 12 25 -1.8993 29.0773
360 │ 2013 12 26 7.17419 33.0568
361 │ 2013 12 27 -0.148803 27.1054
362 │ 2013 12 28 -3.25953 24.4779
363 │ 2013 12 29 18.7638 46.5658
364 │ 2013 12 30 10.0577 29.4776
365 │ 2013 12 31 6.21212 23.1941
350 rows omitted
Spread: sd(x)
, IQR(x)
, mad(x)
.
# destinations with largest variation in distance
not_cancelled |>
group_by (dest) |>
summarise (distance_sd = sd (distance)) |>
arrange (desc (distance_sd))
# A tibble: 104 × 2
dest distance_sd
<chr> <dbl>
1 EGE 10.5
2 SAN 10.4
3 SFO 10.2
4 HNL 10.0
5 SEA 9.98
6 LAS 9.91
7 PDX 9.87
8 PHX 9.86
9 LAX 9.66
10 IND 9.46
# ℹ 94 more rows
flights.groupby(['dest' ]).agg(
distance_sd = ('distance' , np.std)
).sort_values('distance_sd' , ascending = False )
distance_sd
dest
EGE 10.549066
SAN 10.346588
SFO 10.216858
HNL 10.001783
SEA 9.979116
LAS 9.913856
PDX 9.880313
PHX 9.856062
LAX 9.657222
IND 9.455668
SAT 9.000096
CVG 8.992937
MEM 8.497715
HOU 8.476723
MCI 8.387283
AUS 8.318982
DFW 8.305072
CLE 8.282101
TYS 8.004084
IAH 8.001671
DEN 7.981494
STL 7.899533
BNA 7.814711
ORD 7.731902
PIT 7.731472
BUF 7.724076
DTW 7.581811
ROC 7.452263
SDF 7.367790
SLC 7.363589
GSO 7.352531
MSP 7.296561
SBN 7.229569
XNA 7.194400
ATL 7.155615
MSY 7.088342
CMH 7.071074
MDW 7.000700
DAY 6.988557
BWI 6.920632
MKE 6.918848
BOS 6.878423
CLT 6.824532
IAD 6.774477
RIC 6.497803
MSN 6.319128
DSM 6.275483
PWM 5.633014
DCA 5.588415
JAC 5.537749
RDU 5.378799
SYR 5.305292
STT 5.291656
GSP 5.248836
PBI 5.166350
CHS 5.141765
TPA 5.104954
MCO 5.079727
MHT 4.870863
TVC 4.705253
MIA 4.607839
RSW 4.589385
CAE 4.587971
FLL 4.492772
JAX 4.452836
OMA 4.415946
BQN 4.239051
ORF 4.043022
SAV 3.897944
SJU 3.869986
GRR 3.092497
AVL 3.000553
SRQ 2.929607
MYR 2.880434
PHL 2.696727
BTV 2.538673
BHM 0.058026
SMF 0.000000
SJC 0.000000
BDL 0.000000
BGR 0.000000
SNA 0.000000
TUL 0.000000
ALB 0.000000
ANC 0.000000
EYW 0.000000
PVD 0.000000
CAK 0.000000
HDN 0.000000
CRW 0.000000
ILM 0.000000
LGB 0.000000
ACK 0.000000
CHO 0.000000
MTJ 0.000000
PSP 0.000000
MVY 0.000000
OAK 0.000000
OKC 0.000000
BZN 0.000000
BUR 0.000000
PSE 0.000000
ABQ 0.000000
LEX NaN
LGA NaN
<string>:1: FutureWarning: The provided callable <function std at 0xffff63fe8700> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
@pipe flights |>
groupby (_, : dest) |>
combine (_, : distance => std => : distance_sd) |>
sort (_, : distance_sd, rev = true )
105×2 DataFrame
Row │ dest distance_sd
│ String Float64
─────┼─────────────────────
1 │ LEX NaN
2 │ LGA NaN
3 │ EGE 10.5491
4 │ SAN 10.3466
5 │ SFO 10.2169
6 │ HNL 10.0018
7 │ SEA 9.97912
8 │ LAS 9.91386
⋮ │ ⋮ ⋮
99 │ ACK 0.0
100 │ BGR 0.0
101 │ ABQ 0.0
102 │ ILM 0.0
103 │ MVY 0.0
104 │ CHO 0.0
105 │ ANC 0.0
90 rows omitted
Rank: min(x)
, quantile(x, 0.25)
, max(x)
.
# Earliest and latest flights on each day?
not_cancelled |>
group_by (year, month, day) |>
summarise (
first = min (dep_time),
last = max (dep_time)
)
# A tibble: 365 × 5
# Groups: year, month [12]
year month day first last
<int> <int> <int> <int> <int>
1 2013 1 1 517 2356
2 2013 1 2 42 2354
3 2013 1 3 32 2349
4 2013 1 4 25 2358
5 2013 1 5 14 2357
6 2013 1 6 16 2355
7 2013 1 7 49 2359
8 2013 1 8 454 2351
9 2013 1 9 2 2252
10 2013 1 10 3 2320
# ℹ 355 more rows
not_cancelled.groupby(['year' , 'month' , 'day' ]).agg(
first = ('dep_time' , np.min ),
last = ('dep_time' , np.max )
)
first last
year month day
2013 1 1 517.0 2356.0
2 42.0 2354.0
3 32.0 2349.0
4 25.0 2358.0
5 14.0 2357.0
6 16.0 2355.0
7 49.0 2359.0
8 454.0 2351.0
9 2.0 2252.0
10 3.0 2320.0
11 11.0 2304.0
12 30.0 2359.0
13 1.0 2359.0
14 453.0 2353.0
15 453.0 2356.0
16 2.0 2358.0
17 453.0 2352.0
18 455.0 2359.0
19 456.0 2359.0
20 525.0 2356.0
21 455.0 2352.0
22 5.0 2358.0
23 158.0 2358.0
24 37.0 2350.0
25 15.0 2359.0
26 107.0 2352.0
27 523.0 2355.0
28 10.0 2356.0
29 448.0 2354.0
30 3.0 2354.0
31 1.0 2354.0
2 1 456.0 2355.0
2 3.0 2359.0
3 31.0 2352.0
4 453.0 2351.0
5 450.0 2355.0
6 6.0 2349.0
7 27.0 2400.0
8 458.0 1728.0
9 901.0 2358.0
10 455.0 2359.0
11 1.0 2400.0
12 17.0 2358.0
13 4.0 2353.0
14 7.0 2358.0
15 3.0 2358.0
16 4.0 2357.0
17 25.0 2356.0
18 7.0 2347.0
19 3.0 2355.0
20 10.0 2353.0
21 20.0 2356.0
22 9.0 2359.0
23 18.0 2355.0
24 1.0 2353.0
25 19.0 2357.0
26 6.0 2356.0
27 25.0 2348.0
28 457.0 2359.0
3 1 4.0 2357.0
2 43.0 2356.0
3 509.0 2353.0
4 11.0 2353.0
5 505.0 2351.0
6 4.0 2356.0
7 7.0 2351.0
8 1.0 2359.0
9 53.0 2345.0
10 6.0 2354.0
11 14.0 2355.0
12 2.0 2359.0
13 103.0 2355.0
14 9.0 2355.0
15 11.0 2400.0
16 37.0 2356.0
17 17.0 2321.0
18 1.0 2357.0
19 1.0 2357.0
20 31.0 2358.0
21 4.0 2352.0
22 37.0 2400.0
23 18.0 2349.0
24 2.0 2357.0
25 13.0 2400.0
26 18.0 2355.0
27 10.0 2359.0
28 4.0 2355.0
29 453.0 2357.0
30 6.0 2355.0
31 51.0 2358.0
4 1 454.0 2358.0
2 9.0 2400.0
3 2.0 2355.0
4 14.0 2400.0
5 1.0 2354.0
6 453.0 2359.0
7 30.0 2358.0
8 454.0 2355.0
9 449.0 2356.0
10 1.0 2357.0
11 25.0 2353.0
12 2.0 2357.0
13 59.0 2355.0
14 16.0 2357.0
15 2.0 2358.0
16 454.0 2359.0
17 8.0 2357.0
18 2.0 2358.0
19 11.0 2359.0
20 7.0 2400.0
21 8.0 2357.0
22 2.0 2352.0
23 4.0 2352.0
24 30.0 2358.0
25 5.0 2356.0
26 22.0 2358.0
27 453.0 2351.0
28 452.0 2352.0
29 2.0 2346.0
30 455.0 2351.0
5 1 9.0 2356.0
2 3.0 2352.0
3 32.0 2359.0
4 456.0 2359.0
5 446.0 2355.0
6 450.0 2359.0
7 2.0 2356.0
8 2.0 2358.0
9 9.0 2356.0
10 3.0 2325.0
11 2.0 2355.0
12 3.0 2359.0
13 16.0 2349.0
14 455.0 2357.0
15 48.0 2358.0
16 8.0 2358.0
17 11.0 2357.0
18 7.0 2356.0
19 2.0 2346.0
20 21.0 2357.0
21 110.0 2400.0
22 1.0 2356.0
23 4.0 2354.0
24 10.0 2358.0
25 1.0 2355.0
26 4.0 2358.0
27 455.0 2358.0
28 22.0 2356.0
29 457.0 2352.0
30 453.0 2358.0
31 33.0 2355.0
6 1 2.0 2359.0
2 14.0 2357.0
3 3.0 2353.0
4 452.0 2355.0
5 23.0 2359.0
6 454.0 2359.0
7 16.0 2359.0
8 2.0 2358.0
9 453.0 2356.0
10 7.0 2352.0
11 455.0 2358.0
12 5.0 2355.0
13 3.0 2355.0
14 24.0 2358.0
15 456.0 2356.0
16 14.0 2355.0
17 2.0 2400.0
18 3.0 2358.0
19 7.0 2357.0
20 1.0 2356.0
21 9.0 2354.0
22 3.0 2359.0
23 7.0 2354.0
24 1.0 2359.0
25 1.0 2357.0
26 13.0 2353.0
27 10.0 2359.0
28 2.0 2359.0
29 11.0 2342.0
30 12.0 2359.0
7 1 1.0 2359.0
2 12.0 2349.0
3 28.0 2359.0
4 11.0 2357.0
5 38.0 2358.0
6 32.0 2333.0
7 4.0 2400.0
8 2.0 2353.0
9 17.0 2359.0
10 17.0 2357.0
11 5.0 2357.0
12 5.0 2358.0
13 19.0 2400.0
14 6.0 2358.0
15 9.0 2357.0
16 19.0 2356.0
17 2.0 2400.0
18 9.0 2356.0
19 24.0 2354.0
20 6.0 2324.0
21 11.0 2356.0
22 1.0 2359.0
23 4.0 2355.0
24 6.0 2357.0
25 6.0 2357.0
26 9.0 2355.0
27 1.0 2358.0
28 1.0 2400.0
29 5.0 2359.0
30 3.0 2329.0
31 10.0 2352.0
8 1 12.0 2358.0
2 6.0 2357.0
3 2.0 2357.0
4 3.0 2321.0
5 8.0 2337.0
6 24.0 2358.0
7 453.0 2358.0
8 10.0 2359.0
9 4.0 2359.0
10 13.0 2400.0
11 4.0 2352.0
12 12.0 2358.0
13 208.0 2359.0
14 7.0 2338.0
15 1.0 2359.0
16 23.0 2358.0
17 7.0 2354.0
18 13.0 2359.0
19 1.0 2358.0
20 54.0 2400.0
21 10.0 2353.0
22 9.0 2350.0
23 9.0 2328.0
24 24.0 2359.0
25 3.0 2351.0
26 13.0 2258.0
27 8.0 2350.0
28 10.0 2358.0
29 21.0 2357.0
30 450.0 2359.0
31 455.0 2359.0
9 1 9.0 2352.0
2 8.0 2400.0
3 457.0 2357.0
4 446.0 2258.0
5 450.0 2352.0
6 8.0 2341.0
7 452.0 2246.0
8 509.0 2358.0
9 451.0 2354.0
10 453.0 2347.0
11 11.0 2351.0
12 2.0 2400.0
13 3.0 2352.0
14 453.0 2349.0
15 509.0 2351.0
16 453.0 2346.0
17 450.0 2352.0
18 450.0 2349.0
19 10.0 2353.0
20 9.0 2349.0
21 8.0 2306.0
22 518.0 2350.0
23 455.0 2342.0
24 455.0 2348.0
25 451.0 2358.0
26 451.0 2349.0
27 453.0 2346.0
28 451.0 2351.0
29 521.0 2356.0
30 453.0 2349.0
10 1 447.0 2349.0
2 449.0 2341.0
3 453.0 2355.0
4 7.0 2343.0
5 453.0 2354.0
6 519.0 2346.0
7 6.0 2356.0
8 141.0 2349.0
9 452.0 2346.0
10 457.0 2358.0
11 8.0 2359.0
12 457.0 2354.0
13 15.0 2352.0
14 451.0 2347.0
15 456.0 2355.0
16 454.0 2357.0
17 453.0 2357.0
18 9.0 2317.0
19 454.0 2356.0
20 517.0 2349.0
21 453.0 2356.0
22 454.0 2343.0
23 454.0 2355.0
24 453.0 2347.0
25 13.0 2353.0
26 451.0 2356.0
27 16.0 2352.0
28 453.0 2357.0
29 456.0 2353.0
30 455.0 2400.0
31 458.0 2357.0
11 1 5.0 2353.0
2 453.0 2358.0
3 519.0 2358.0
4 11.0 2357.0
5 4.0 2359.0
6 457.0 2355.0
7 457.0 2358.0
8 457.0 2357.0
9 52.0 2355.0
10 147.0 2354.0
11 453.0 2355.0
12 455.0 2356.0
13 1.0 2355.0
14 453.0 2359.0
15 3.0 2354.0
16 455.0 2359.0
17 3.0 2359.0
18 11.0 2355.0
19 59.0 2356.0
20 455.0 2355.0
21 6.0 2351.0
22 6.0 2351.0
23 504.0 2356.0
24 22.0 2358.0
25 5.0 2355.0
26 458.0 2357.0
27 28.0 2400.0
28 514.0 2356.0
29 518.0 2353.0
30 11.0 2354.0
12 1 13.0 2354.0
2 12.0 2358.0
3 457.0 2358.0
4 457.0 2355.0
5 16.0 2400.0
6 6.0 2357.0
7 3.0 2358.0
8 16.0 2354.0
9 11.0 2400.0
10 7.0 2348.0
11 459.0 2359.0
12 2.0 2356.0
13 15.0 2400.0
14 3.0 2359.0
15 5.0 2359.0
16 1.0 2332.0
17 4.0 2358.0
18 500.0 2358.0
19 17.0 2400.0
20 1.0 2358.0
21 2.0 2357.0
22 7.0 2358.0
23 11.0 2354.0
24 9.0 2358.0
25 456.0 2357.0
26 1.0 2355.0
27 2.0 2351.0
28 7.0 2358.0
29 3.0 2400.0
30 1.0 2356.0
31 13.0 2356.0
<string>:1: FutureWarning: The provided callable <function min at 0xffff641d3d90> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
<string>:1: FutureWarning: The provided callable <function max at 0xffff641d3c70> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
@pipe not_cancelled |>
groupby (_, [: year, : month, : day]) |>
combine (_, : dep_time => (x -> [extrema (x)]) => [: first, : last])
365×5 DataFrame
Row │ year month day first last
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 2013 1 1 517 2356
2 │ 2013 1 2 42 2354
3 │ 2013 1 3 32 2349
4 │ 2013 1 4 25 2358
5 │ 2013 1 5 14 2357
6 │ 2013 1 6 16 2355
7 │ 2013 1 7 49 2359
8 │ 2013 1 8 454 2351
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
359 │ 2013 12 25 456 2357
360 │ 2013 12 26 1 2355
361 │ 2013 12 27 2 2351
362 │ 2013 12 28 7 2358
363 │ 2013 12 29 3 2400
364 │ 2013 12 30 1 2356
365 │ 2013 12 31 13 2356
350 rows omitted
Position: first(x)
, nth(x, 2)
, last(x)
. Note unless the variable is sorted, first
is different from min
and last
is different from max
.
not_cancelled |>
group_by (year, month, day) |>
summarise (
first_dep = first (dep_time),
last_dep = last (dep_time)
)
# A tibble: 365 × 5
# Groups: year, month [12]
year month day first_dep last_dep
<int> <int> <int> <int> <int>
1 2013 1 1 517 2356
2 2013 1 2 42 2354
3 2013 1 3 32 2349
4 2013 1 4 25 2358
5 2013 1 5 14 2357
6 2013 1 6 16 2355
7 2013 1 7 49 2359
8 2013 1 8 454 2351
9 2013 1 9 2 2252
10 2013 1 10 3 2320
# ℹ 355 more rows
not_cancelled.groupby(['year' , 'month' , 'day' ]).agg(
first_dep = ('dep_time' , lambda x: x.iloc[0 ]),
last_dep = ('dep_time' , lambda x: x.iloc[- 1 ]),
)
first_dep last_dep
year month day
2013 1 1 517.0 2356.0
2 42.0 2354.0
3 32.0 2349.0
4 25.0 2358.0
5 14.0 2357.0
6 16.0 2355.0
7 49.0 2359.0
8 454.0 2351.0
9 2.0 2252.0
10 3.0 2320.0
11 11.0 2304.0
12 30.0 2359.0
13 1.0 2359.0
14 453.0 2353.0
15 453.0 2356.0
16 2.0 2358.0
17 453.0 2352.0
18 455.0 2359.0
19 456.0 2359.0
20 525.0 2356.0
21 455.0 2352.0
22 5.0 2358.0
23 158.0 2358.0
24 37.0 2350.0
25 15.0 2359.0
26 107.0 2352.0
27 523.0 2355.0
28 10.0 2356.0
29 448.0 2354.0
30 3.0 2354.0
31 1.0 2354.0
2 1 456.0 2355.0
2 3.0 2359.0
3 31.0 2352.0
4 453.0 2351.0
5 450.0 2355.0
6 6.0 2349.0
7 27.0 2400.0
8 458.0 1728.0
9 901.0 2358.0
10 455.0 2359.0
11 1.0 2400.0
12 17.0 2358.0
13 4.0 2353.0
14 7.0 2358.0
15 3.0 2358.0
16 4.0 2357.0
17 25.0 2356.0
18 7.0 2347.0
19 3.0 2355.0
20 10.0 2353.0
21 20.0 2356.0
22 9.0 2359.0
23 18.0 2355.0
24 1.0 2353.0
25 19.0 2357.0
26 6.0 2356.0
27 25.0 2348.0
28 457.0 2359.0
3 1 4.0 2357.0
2 43.0 2356.0
3 509.0 2353.0
4 11.0 2353.0
5 505.0 2351.0
6 4.0 2356.0
7 7.0 2351.0
8 1.0 2359.0
9 53.0 2345.0
10 6.0 2354.0
11 14.0 2355.0
12 2.0 2359.0
13 103.0 2355.0
14 9.0 2355.0
15 11.0 2400.0
16 37.0 2356.0
17 17.0 2321.0
18 1.0 2357.0
19 1.0 2357.0
20 31.0 2358.0
21 4.0 2352.0
22 37.0 2400.0
23 18.0 2349.0
24 2.0 2357.0
25 13.0 2400.0
26 18.0 2355.0
27 10.0 2359.0
28 4.0 2355.0
29 453.0 2357.0
30 6.0 2355.0
31 51.0 2358.0
4 1 454.0 2358.0
2 9.0 2400.0
3 2.0 2355.0
4 14.0 2400.0
5 1.0 2354.0
6 453.0 2359.0
7 30.0 2358.0
8 454.0 2355.0
9 449.0 2356.0
10 1.0 2357.0
11 25.0 2353.0
12 2.0 2357.0
13 59.0 2355.0
14 16.0 2357.0
15 2.0 2358.0
16 454.0 2359.0
17 8.0 2357.0
18 2.0 2358.0
19 11.0 2359.0
20 7.0 2400.0
21 8.0 2357.0
22 2.0 2352.0
23 4.0 2352.0
24 30.0 2358.0
25 5.0 2356.0
26 22.0 2358.0
27 453.0 2351.0
28 452.0 2352.0
29 2.0 2346.0
30 455.0 2351.0
5 1 9.0 2356.0
2 3.0 2352.0
3 32.0 2359.0
4 456.0 2359.0
5 446.0 2355.0
6 450.0 2359.0
7 2.0 2356.0
8 2.0 2358.0
9 9.0 2356.0
10 3.0 2325.0
11 2.0 2355.0
12 3.0 2359.0
13 16.0 2349.0
14 455.0 2357.0
15 48.0 2358.0
16 8.0 2358.0
17 11.0 2357.0
18 7.0 2356.0
19 2.0 2346.0
20 21.0 2357.0
21 110.0 2400.0
22 1.0 2356.0
23 4.0 2354.0
24 10.0 2358.0
25 1.0 2355.0
26 4.0 2358.0
27 455.0 2358.0
28 22.0 2356.0
29 457.0 2352.0
30 453.0 2358.0
31 33.0 2355.0
6 1 2.0 2359.0
2 14.0 2357.0
3 3.0 2353.0
4 452.0 2355.0
5 23.0 2359.0
6 454.0 2359.0
7 16.0 2359.0
8 2.0 2358.0
9 453.0 2356.0
10 7.0 2352.0
11 455.0 2358.0
12 5.0 2355.0
13 3.0 2355.0
14 24.0 2358.0
15 456.0 2356.0
16 14.0 2355.0
17 2.0 2400.0
18 3.0 2358.0
19 7.0 2357.0
20 1.0 2356.0
21 9.0 2354.0
22 3.0 2359.0
23 7.0 2354.0
24 1.0 2359.0
25 1.0 2357.0
26 13.0 2353.0
27 10.0 2359.0
28 2.0 2359.0
29 11.0 2342.0
30 12.0 2359.0
7 1 1.0 2359.0
2 12.0 2349.0
3 28.0 2359.0
4 11.0 2357.0
5 38.0 2358.0
6 32.0 2333.0
7 4.0 2400.0
8 2.0 2353.0
9 17.0 2359.0
10 17.0 2357.0
11 5.0 2357.0
12 5.0 2358.0
13 19.0 2400.0
14 6.0 2358.0
15 9.0 2357.0
16 19.0 2356.0
17 2.0 2400.0
18 9.0 2356.0
19 24.0 2354.0
20 6.0 2324.0
21 11.0 2356.0
22 1.0 2359.0
23 4.0 2355.0
24 6.0 2357.0
25 6.0 2357.0
26 9.0 2355.0
27 1.0 2358.0
28 1.0 2400.0
29 5.0 2359.0
30 3.0 2329.0
31 10.0 2352.0
8 1 12.0 2358.0
2 6.0 2357.0
3 2.0 2357.0
4 3.0 2321.0
5 8.0 2337.0
6 24.0 2358.0
7 453.0 2358.0
8 10.0 2359.0
9 4.0 2359.0
10 13.0 2400.0
11 4.0 2352.0
12 12.0 2358.0
13 208.0 2359.0
14 7.0 2338.0
15 1.0 2359.0
16 23.0 2358.0
17 7.0 2354.0
18 13.0 2359.0
19 1.0 2358.0
20 54.0 2400.0
21 10.0 2353.0
22 9.0 2350.0
23 9.0 2328.0
24 24.0 2359.0
25 3.0 2351.0
26 13.0 2258.0
27 8.0 2350.0
28 10.0 2358.0
29 21.0 2357.0
30 450.0 2359.0
31 455.0 2359.0
9 1 9.0 2352.0
2 8.0 2400.0
3 457.0 2357.0
4 446.0 2258.0
5 450.0 2352.0
6 8.0 2341.0
7 452.0 2246.0
8 509.0 2358.0
9 451.0 2354.0
10 453.0 2347.0
11 11.0 2351.0
12 2.0 2400.0
13 3.0 2352.0
14 453.0 2349.0
15 509.0 2351.0
16 453.0 2346.0
17 450.0 2352.0
18 450.0 2349.0
19 10.0 2353.0
20 9.0 2349.0
21 8.0 2306.0
22 518.0 2350.0
23 455.0 2342.0
24 455.0 2348.0
25 451.0 2358.0
26 451.0 2349.0
27 453.0 2346.0
28 451.0 2351.0
29 521.0 2356.0
30 453.0 2349.0
10 1 447.0 2349.0
2 449.0 2341.0
3 453.0 2355.0
4 7.0 2343.0
5 453.0 2354.0
6 519.0 2346.0
7 6.0 2356.0
8 141.0 2349.0
9 452.0 2346.0
10 457.0 2358.0
11 8.0 2359.0
12 457.0 2354.0
13 15.0 2352.0
14 451.0 2347.0
15 456.0 2355.0
16 454.0 2357.0
17 453.0 2357.0
18 9.0 2317.0
19 454.0 2356.0
20 517.0 2349.0
21 453.0 2356.0
22 454.0 2343.0
23 454.0 2355.0
24 453.0 2347.0
25 13.0 2353.0
26 451.0 2356.0
27 16.0 2352.0
28 453.0 2357.0
29 456.0 2353.0
30 455.0 2400.0
31 458.0 2357.0
11 1 5.0 2353.0
2 453.0 2358.0
3 519.0 2358.0
4 11.0 2357.0
5 4.0 2359.0
6 457.0 2355.0
7 457.0 2358.0
8 457.0 2357.0
9 52.0 2355.0
10 147.0 2354.0
11 453.0 2355.0
12 455.0 2356.0
13 1.0 2355.0
14 453.0 2359.0
15 3.0 2354.0
16 455.0 2359.0
17 3.0 2359.0
18 11.0 2355.0
19 59.0 2356.0
20 455.0 2355.0
21 6.0 2351.0
22 6.0 2351.0
23 504.0 2356.0
24 22.0 2358.0
25 5.0 2355.0
26 458.0 2357.0
27 28.0 2400.0
28 514.0 2356.0
29 518.0 2353.0
30 11.0 2354.0
12 1 13.0 2354.0
2 12.0 2358.0
3 457.0 2358.0
4 457.0 2355.0
5 16.0 2400.0
6 6.0 2357.0
7 3.0 2358.0
8 16.0 2354.0
9 11.0 2400.0
10 7.0 2348.0
11 459.0 2359.0
12 2.0 2356.0
13 15.0 2400.0
14 3.0 2359.0
15 5.0 2359.0
16 1.0 2332.0
17 4.0 2358.0
18 500.0 2358.0
19 17.0 2400.0
20 1.0 2358.0
21 2.0 2357.0
22 7.0 2358.0
23 11.0 2354.0
24 9.0 2358.0
25 456.0 2357.0
26 1.0 2355.0
27 2.0 2351.0
28 7.0 2358.0
29 3.0 2400.0
30 1.0 2356.0
31 13.0 2356.0
@pipe not_cancelled |>
groupby (_, [: year, : month, : day]) |>
combine (
_,
: dep_time => first => : first_dep,
: dep_time => last => : last_dep
)
365×5 DataFrame
Row │ year month day first_dep last_dep
│ Int64 Int64 Int64 Int64 Int64
─────┼──────────────────────────────────────────
1 │ 2013 1 1 517 2356
2 │ 2013 1 2 42 2354
3 │ 2013 1 3 32 2349
4 │ 2013 1 4 25 2358
5 │ 2013 1 5 14 2357
6 │ 2013 1 6 16 2355
7 │ 2013 1 7 49 2359
8 │ 2013 1 8 454 2351
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
359 │ 2013 12 25 456 2357
360 │ 2013 12 26 1 2355
361 │ 2013 12 27 2 2351
362 │ 2013 12 28 7 2358
363 │ 2013 12 29 3 2400
364 │ 2013 12 30 1 2356
365 │ 2013 12 31 13 2356
350 rows omitted
Count: n(x)
, sum(!is.na(x))
, n_distinct(x)
.
# Which destinations have the most carriers?
not_cancelled |>
group_by (dest) |>
summarise (carriers = n_distinct (carrier)) |>
arrange (desc (carriers))
# A tibble: 104 × 2
dest carriers
<chr> <int>
1 ATL 7
2 BOS 7
3 CLT 7
4 ORD 7
5 TPA 7
6 AUS 6
7 DCA 6
8 DTW 6
9 IAD 6
10 MSP 6
# ℹ 94 more rows
Similarly
# which destination has most flights from NYC?
not_cancelled |>
count (dest) |>
arrange (desc (n))
# A tibble: 104 × 2
dest n
<chr> <int>
1 ATL 16837
2 ORD 16566
3 LAX 16026
4 BOS 15022
5 MCO 13967
6 CLT 13674
7 SFO 13173
8 FLL 11897
9 MIA 11593
10 DCA 9111
# ℹ 94 more rows
not_cancelled.groupby('dest' ).agg(
carriers = ('carrier' , lambda x: x.nunique(dropna = True ))
).sort_values('carriers' , ascending = False )
carriers
dest
ORD 7
CLT 7
ATL 7
BOS 7
TPA 7
DTW 6
PIT 6
STL 6
DCA 6
AUS 6
MSY 6
MSP 6
IAD 6
SFO 5
RDU 5
CLE 5
PHX 5
SEA 5
DEN 5
PHL 5
PBI 5
LAS 5
LAX 5
BNA 5
IND 5
SAN 4
JAX 4
FLL 4
RSW 4
MCO 4
MKE 4
DFW 4
CVG 4
SRQ 4
CHS 4
SJU 4
BUF 4
BWI 4
ORF 3
MIA 3
MEM 3
SYR 3
OMA 3
STT 3
BTV 3
MCI 3
PDX 3
CMH 3
PWM 3
SDF 3
ROC 3
SAT 3
MSN 2
SJC 2
MVY 2
RIC 2
TVC 2
TYS 2
SLC 2
XNA 2
MHT 2
HOU 2
AVL 2
BDL 2
BQN 2
CAE 2
DAY 2
DSM 2
GRR 2
GSO 2
GSP 2
HNL 2
EGE 2
IAH 2
JAC 2
CAK 1
ALB 1
ANC 1
TUL 1
MTJ 1
ACK 1
BGR 1
BHM 1
MYR 1
SNA 1
SMF 1
BUR 1
BZN 1
OAK 1
CHO 1
HDN 1
CRW 1
SBN 1
SAV 1
OKC 1
LGB 1
MDW 1
EYW 1
LEX 1
ILM 1
PVD 1
PSP 1
PSE 1
ABQ 1
@pipe not_cancelled |>
groupby (_, : dest) |>
combine (_, : carrier => length ∘ unique => : carriers) |>
sort (_, : carriers, rev = true )
104×2 DataFrame
Row │ dest carriers
│ String Int64
─────┼──────────────────
1 │ ATL 7
2 │ ORD 7
3 │ TPA 7
4 │ BOS 7
5 │ CLT 7
6 │ IAD 6
7 │ MSP 6
8 │ DTW 6
⋮ │ ⋮ ⋮
98 │ BGR 1
99 │ ABQ 1
100 │ ILM 1
101 │ SBN 1
102 │ LEX 1
103 │ CHO 1
104 │ ANC 1
89 rows omitted
Example: which aircraft flew most (in distance) in 2013?
not_cancelled |>
count (tailnum, wt = distance) |>
arrange (desc (n))
# A tibble: 4,037 × 2
tailnum n
<chr> <dbl>
1 N328AA 929090
2 N338AA 921172
3 N335AA 902271
4 N327AA 900482
5 N323AA 839468
6 N319AA 837924
7 N336AA 833136
8 N329AA 825826
9 N324AA 786159
10 N339AA 783648
# ℹ 4,027 more rows
not_cancelled.groupby('tailnum' ).agg(
total_distance = ('distance' , sum )
).sort_values('total_distance' , ascending = False )
total_distance
tailnum
N328AA 929090
N338AA 921172
N335AA 902271
N327AA 900482
N323AA 839468
... ...
N762SK 419
N824AS 296
N881AS 292
N746SK 229
N505SW 185
[4037 rows x 1 columns]
<string>:1: FutureWarning: The provided callable <built-in function sum> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
@pipe not_cancelled |>
groupby (_, : tailnum) |>
combine (_, : distance => sum ∘ skipmissing => : total_distance) |>
sort (_, : total_distance, rev = true )
4037×2 DataFrame
Row │ tailnum total_distance
│ String? Float64
──────┼─────────────────────────
1 │ N328AA 929090.0
2 │ N338AA 921172.0
3 │ N335AA 902271.0
4 │ N327AA 900482.0
5 │ N323AA 839468.0
6 │ N319AA 837924.0
7 │ N336AA 833136.0
8 │ N329AA 825826.0
⋮ │ ⋮ ⋮
4031 │ N772SK 419.0
4032 │ N776SK 419.0
4033 │ N785SK 419.0
4034 │ N824AS 296.0
4035 │ N881AS 292.0
4036 │ N746SK 229.0
4037 │ N505SW 185.0
4022 rows omitted
Example: How many flights left before 5am? (these usually indicate delayed flights from the previous day)
not_cancelled |>
group_by (year, month, day) |>
summarise (n_early = sum (dep_time < 500 )) |>
arrange (desc (n_early))
# A tibble: 365 × 4
# Groups: year, month [12]
year month day n_early
<int> <int> <int> <int>
1 2013 6 28 32
2 2013 4 10 30
3 2013 7 28 30
4 2013 3 18 29
5 2013 7 7 29
6 2013 7 10 27
7 2013 6 27 25
8 2013 6 13 24
9 2013 3 8 22
10 2013 7 22 22
# ℹ 355 more rows
not_cancelled.groupby(['year' , 'month' , 'day' ]).agg(
n_early = ('dep_time' , lambda x: sum (x < 500 ))
).sort_values('n_early' , ascending = False )
n_early
year month day
2013 6 28 32
7 28 30
4 10 30
7 7 29
3 18 29
7 10 27
6 27 25
13 24
7 22 22
3 8 22
8 8 22
6 24 21
9 12 21
4 19 20
5 23 15
12 17 15
5 8 14
7 1 14
2 11 14
8 9 13
1 13
3 12 13
7 8 12
6 25 12
4 22 12
6 18 12
7 23 11
5 22 11
6 2 10
4 18 10
7 19 10
6 26 10
7 12 9
5 19 9
6 30 9
3 7 9
7 2 9
1 13 9
31 9
12 23 9
5 9
8 22 9
9 2 9
7 17 8
4 17 8
5 10 8
4 23 8
8 28 8
5 11 8
7 21 8
10 7 7
12 8 7
14 7
7 16 7
18 7
2 19 7
4 12 7
12 27 7
6 10 7
29 7
3 17 6
6 23 6
5 20 6
11 17 6
3 24 6
7 3 6
4 24 6
1 25 6
12 16 6
2 27 6
7 11 6
12 30 6
2 15 6
12 21 5
12 5
6 7 5
8 11 5
12 10 5
6 17 5
8 16 5
5 2 5
12 9 5
7 13 5
9 5
6 5
3 10 5
2 23 5
5 16 4
12 22 4
4 29 4
8 5 4
6 20 4
21 4
22 4
11 21 4
7 31 4
9 20 4
11 4
8 26 4
23 4
21 4
7 20 4
25 4
8 14 4
7 27 4
10 11 4
12 31 4
3 6 4
1 3 4
4 5 4
3 19 4
1 4
2 26 4
24 4
3 26 4
4 14 4
11 4
8 6 3
12 13 3
3 16 3
1 11 3
8 4 3
3 21 3
23 3
6 14 3
8 12 3
3 25 3
28 3
8 3 3
6 16 3
8 10 3
3 15 3
7 15 3
3 11 3
7 26 3
2 16 3
17 3
18 3
20 3
7 24 3
2 22 3
7 14 3
12 7 3
1 3
1 16 3
7 30 3
4 3
1 2 3
8 17 3
1 10 3
2 14 3
5 26 3
1 4 3
11 1 3
4 3 3
5 3 3
12 26 3
19 3
1 24 3
5 24 3
25 3
8 29 3
1 5 3
12 29 3
4 25 3
1 9 3
4 13 3
7 29 2
9 19 2
11 30 2
4 2
2 7 2
8 20 2
12 2 2
1 22 2
9 21 2
10 25 2
4 2
18 2
8 2
11 5 2
9 6 2
1 23 2
11 24 2
1 30 2
28 2
2 13 2
11 19 2
8 15 2
2 2 2
4 15 2
8 13 2
11 18 2
9 2
8 27 2
1 26 2
2 6 2
9 1 2
11 15 2
13 2
8 19 2
2 2
11 22 2
8 25 2
2 21 2
3 30 2
5 31 2
12 15 2
5 28 2
3 31 2
4 2 2
5 18 2
17 2
15 2
1 7 2
5 13 2
12 2
1 6 2
5 9 2
4 7 2
5 7 2
12 24 2
5 1 2
12 28 2
4 26 2
21 2
20 2
6 1 2
4 4 2
10 13 2
2 28 2
3 4 2
12 6 2
3 13 2
6 3 2
3 14 2
20 2
22 2
6 19 2
3 2 2
7 5 2
6 12 2
5 2
3 27 2
10 19 1
14 1
15 1
31 1
12 1
30 1
16 1
17 1
29 1
21 1
28 1
27 1
12 4 1
10 26 1
12 25 1
10 23 1
22 1
24 1
1 17 1
11 2 1
16 1
12 11 1
11 20 1
1 12 1
14 1
8 1
12 20 1
1 15 1
18 1
11 25 1
6 1
12 3 1
11 26 1
14 1
27 1
12 1
11 1
10 1
8 1
7 1
4 16 1
1 19 1
8 24 1
6 8 1
9 1
11 1
15 1
3 9 1
2 25 1
12 1
10 1
8 1
8 7 1
2 5 1
4 1
3 1
10 10 1
8 18 1
3 29 1
6 6 1
4 1
5 5 1
4 27 1
28 1
9 1
30 1
8 1
5 4 1
6 1
30 1
4 6 1
5 14 1
21 1
4 1 1
5 27 1
29 1
1 29 1
2 1 1
9 14 1
28 1
26 1
25 1
24 1
23 1
30 1
1 21 1
9 18 1
8 30 1
9 17 1
16 1
10 1 1
2 1
9 27 1
13 1
10 1
10 3 1
9 9 1
10 5 1
9 7 1
5 1
4 1
10 9 1
9 3 1
8 31 1
10 20 0
6 0
1 20 0
9 29 0
1 27 0
11 3 0
12 18 0
9 22 0
15 0
3 5 0
3 0
2 9 0
11 29 0
28 0
9 8 0
11 23 0
1 1 0
@pipe not_cancelled |>
groupby (_, [: year, : month, : day]) |>
combine (_, : dep_time => (x -> sum (skipmissing (x .< 500 ))) => : n_early) |>
sort (_, : n_early, rev = true )
365×4 DataFrame
Row │ year month day n_early
│ Int64 Int64 Int64 Int64
─────┼──────────────────────────────
1 │ 2013 6 28 32
2 │ 2013 4 10 30
3 │ 2013 7 28 30
4 │ 2013 3 18 29
5 │ 2013 7 7 29
6 │ 2013 7 10 27
7 │ 2013 6 27 25
8 │ 2013 6 13 24
⋮ │ ⋮ ⋮ ⋮ ⋮
359 │ 2013 10 6 0
360 │ 2013 10 20 0
361 │ 2013 11 3 0
362 │ 2013 11 23 0
363 │ 2013 11 28 0
364 │ 2013 11 29 0
365 │ 2013 12 18 0
350 rows omitted
Example: What proportion of flights are delayed by more than an hour?
not_cancelled |>
group_by (year, month, day) |>
summarise (hour_perc = mean (arr_delay > 60 )) |>
arrange (desc (hour_perc))
# A tibble: 365 × 4
# Groups: year, month [12]
year month day hour_perc
<int> <int> <int> <dbl>
1 2013 3 8 0.525
2 2013 6 13 0.402
3 2013 7 1 0.401
4 2013 7 22 0.366
5 2013 5 23 0.361
6 2013 8 8 0.361
7 2013 9 12 0.350
8 2013 12 17 0.339
9 2013 7 10 0.332
10 2013 6 24 0.323
# ℹ 355 more rows
not_cancelled.groupby(['year' , 'month' , 'day' ]).agg(
hour_perc = ('arr_delay' , lambda x: np.mean(x > 60 ))
).sort_values('hour_perc' , ascending = False )
hour_perc
year month day
2013 3 8 0.525063
6 13 0.401816
7 1 0.401368
22 0.365967
5 23 0.361111
8 8 0.360603
9 12 0.350327
12 17 0.338922
7 10 0.332151
6 24 0.322618
7 28 0.311916
23 0.311384
8 9 0.309345
6 28 0.298611
30 0.293833
25 0.292632
9 2 0.286564
7 7 0.285877
10 7 0.285868
6 27 0.279910
12 5 0.276885
4 19 0.272936
8 28 0.270033
5 8 0.269877
12 10 0.269283
7 9 0.263158
4 22 0.259338
12 9 0.257778
14 0.254902
6 18 0.245614
4 10 0.243902
12 8 0.242236
3 18 0.241081
2 11 0.240566
4 18 0.239451
12 0.237418
7 8 0.233753
8 1 0.233083
6 10 0.227176
2 27 0.224806
6 17 0.224156
2 0.218027
26 0.217622
1 31 0.216409
8 22 0.214588
1 30 0.203518
12 23 0.200000
4 25 0.197917
5 19 0.195556
6 7 0.193370
1 16 0.191090
5 22 0.188804
3 19 0.182505
5 24 0.180085
3 12 0.179800
5 11 0.176387
8 13 0.174893
1 25 0.174061
6 3 0.170502
12 22 0.167043
7 12 0.164802
4 11 0.163694
23 0.161430
10 11 0.157219
4 24 0.156652
7 3 0.153450
12 29 0.148618
4 26 0.148262
1 13 0.147460
7 18 0.147271
6 14 0.145530
11 17 0.145161
3 7 0.144154
7 2 0.137297
25 0.133673
13 0.131646
2 8 0.125275
3 25 0.122334
7 21 0.121816
9 11 0.121577
7 11 0.119099
12 6 0.118363
1 24 0.114537
12 15 0.112745
10 10 0.109744
6 29 0.108723
11 27 0.108495
7 19 0.107472
3 6 0.107459
2 23 0.105978
19 0.102481
11 1 0.100427
3 24 0.100112
9 13 0.099794
12 20 0.099792
7 24 0.098577
4 1 0.098019
6 23 0.097588
5 10 0.097107
7 20 0.096939
12 21 0.096130
1 22 0.094915
11 7 0.092324
1 28 0.092182
3 23 0.090789
7 29 0.088957
5 20 0.088174
9 0.088083
1 2 0.085129
9 19 0.084974
2 10 0.083645
22 0.081110
4 17 0.080169
2 4 0.077342
12 26 0.076344
6 11 0.075884
2 9 0.075601
7 17 0.075051
5 3 0.074766
7 26 0.074187
8 14 0.073887
11 12 0.072690
12 30 0.072403
1 1 0.072202
10 25 0.071138
11 24 0.071106
2 1 0.070485
5 30 0.070480
8 2 0.070480
1 23 0.069820
4 2 0.068367
3 16 0.068331
12 19 0.067288
8 3 0.066835
10 3 0.066667
11 22 0.066599
1 21 0.066519
3 14 0.066327
4 16 0.066089
9 0.065762
8 7 0.065392
5 16 0.064286
3 15 0.063918
7 16 0.063136
10 17 0.062690
3 22 0.061728
4 14 0.061538
12 11 0.060445
1 20 0.058899
2 21 0.058761
8 10 0.058531
4 20 0.058355
3 1 0.058263
1 3 0.056667
5 21 0.055907
1 27 0.055831
17 0.055738
2 20 0.055734
3 21 0.055556
12 16 0.055497
4 4 0.055215
2 25 0.054025
3 10 0.053215
6 8 0.053109
12 27 0.053070
8 11 0.053030
2 26 0.052397
8 4 0.052231
7 14 0.051948
5 15 0.051797
4 0.051491
12 18 0.051471
2 0.051102
5 28 0.050672
8 12 0.050454
3 31 0.050448
4 21 0.050109
5 0.050051
7 27 0.049751
5 25 0.049724
3 17 0.048998
6 21 0.048682
19 0.048304
2 17 0.048193
9 20 0.047959
1 6 0.047045
3 20 0.047022
4 13 0.046997
2 5 0.046750
15 0.046512
3 2 0.046419
10 18 0.046418
5 17 0.045918
11 26 0.045685
5 31 0.045643
4 15 0.044534
2 14 0.044118
3 0.044081
4 29 0.044012
3 9 0.043883
8 16 0.043478
12 31 0.043478
8 15 0.043260
11 10 0.042649
2 7 0.042162
8 5 0.042126
2 16 0.042120
12 7 0.042042
7 30 0.041582
8 29 0.041152
3 4 0.040752
12 12 0.040667
3 0.040415
2 6 0.040359
24 0.040276
3 11 0.040041
1 4 0.039648
11 28 0.039432
2 18 0.039320
6 16 0.039301
9 3 0.039236
12 4 0.039153
1 0.038815
8 18 0.038631
6 0.038540
3 13 0.038422
10 31 0.038210
12 28 0.038130
5 6 0.037988
10 15 0.037775
12 25 0.037762
9 21 0.037681
5 2 0.037075
10 26 0.036873
6 12 0.036726
3 3 0.036384
7 31 0.036217
6 22 0.036025
5 13 0.036008
1 26 0.035768
8 30 0.035639
6 4 0.035565
11 18 0.035417
6 15 0.035354
6 0.035306
1 18 0.035165
5 0.034868
8 23 0.034588
10 27 0.034216
2 12 0.033937
3 28 0.033811
9 22 0.033482
7 6 0.033376
1 7 0.033333
5 29 0.033195
7 5 0.033007
2 2 0.032496
11 23 0.032432
9 10 0.032393
1 29 0.032221
3 5 0.032154
10 4 0.031377
6 5 0.031315
8 31 0.031250
2 13 0.031077
6 9 0.030973
8 17 0.030888
11 11 0.030644
5 14 0.030591
10 22 0.030177
13 0.030067
4 28 0.029703
11 8 0.029622
9 23 0.029382
29 0.027473
1 15 0.027242
14 0.027086
12 13 0.026999
10 12 0.026746
5 5 0.026432
7 15 0.026236
2 28 0.026233
11 19 0.025907
5 12 0.025872
9 18 0.025800
10 23 0.025773
11 3 0.025556
10 28 0.025536
8 20 0.025407
10 24 0.025355
9 1 0.025281
10 20 0.025164
1 11 0.025082
5 27 0.024865
10 5 0.024781
4 3 0.024291
10 8 0.023958
11 4 0.023663
4 6 0.023468
8 24 0.023346
25 0.023333
4 7 0.023153
8 27 0.022965
11 6 0.022822
5 18 0.022788
8 26 0.022610
12 24 0.022606
11 16 0.022504
9 15 0.022447
11 25 0.022436
10 6 0.021882
5 1 0.021830
10 9 0.021717
1 8 0.021300
9 27 0.021212
4 27 0.021192
1 19 0.020864
3 26 0.020704
11 2 0.020679
10 30 0.020640
4 8 0.020640
3 29 0.020619
9 14 0.020558
11 20 0.020534
3 27 0.020534
7 4 0.020464
11 14 0.020346
1 9 0.020157
11 30 0.019906
3 30 0.019582
8 19 0.019231
9 26 0.019153
11 21 0.019134
4 30 0.018888
9 8 0.018847
10 29 0.018809
14 0.018330
1 10 0.018299
11 13 0.017472
9 16 0.017294
10 21 0.017224
9 25 0.016598
10 19 0.016153
11 5 0.015609
9 30 0.015213
6 20 0.015198
1 0.014686
9 5 0.014493
10 16 0.014448
8 21 0.014271
11 15 0.014257
9 9 0.014170
4 0.013874
5 7 0.013713
11 29 0.013636
5 26 0.010989
9 28 0.010370
1 12 0.010279
9 7 0.010204
17 0.009395
10 1 0.009375
2 0.009250
9 6 0.008316
24 0.007376
11 9 0.007013
@pipe not_cancelled |>
groupby (_, [: year, : month, : day]) |>
combine (_, : arr_delay => (x -> mean (skipmissing (x .> 60 ))) => : hour_perc) |>
sort (_, : hour_perc, rev = true )
365×4 DataFrame
Row │ year month day hour_perc
│ Int64 Int64 Int64 Float64
─────┼─────────────────────────────────
1 │ 2013 3 8 0.525063
2 │ 2013 6 13 0.401816
3 │ 2013 7 1 0.401368
4 │ 2013 7 22 0.365967
5 │ 2013 5 23 0.361111
6 │ 2013 8 8 0.360603
7 │ 2013 9 12 0.350327
8 │ 2013 12 17 0.338922
⋮ │ ⋮ ⋮ ⋮ ⋮
359 │ 2013 9 7 0.0102041
360 │ 2013 9 17 0.00939457
361 │ 2013 10 1 0.009375
362 │ 2013 10 2 0.00924974
363 │ 2013 9 6 0.00831601
364 │ 2013 9 24 0.00737619
365 │ 2013 11 9 0.00701262
350 rows omitted
Grouped mutates (and filters)
Recall the flights_sml
tibble created earlier:
# A tibble: 336,776 × 7
year month day dep_delay arr_delay distance air_time
<int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227
2 2013 1 1 4 20 1416 227
3 2013 1 1 2 33 1089 160
4 2013 1 1 -1 -18 1576 183
5 2013 1 1 -6 -25 762 116
6 2013 1 1 -4 12 719 150
7 2013 1 1 -5 19 1065 158
8 2013 1 1 -3 -14 229 53
9 2013 1 1 -3 -8 944 140
10 2013 1 1 -2 8 733 138
# ℹ 336,766 more rows
year month day dep_delay arr_delay distance air_time gain \
0 2013 1 1 2.0 11.0 1400 227.0 9.0
1 2013 1 1 4.0 20.0 1416 227.0 16.0
2 2013 1 1 2.0 33.0 1089 160.0 31.0
3 2013 1 1 -1.0 -18.0 1576 183.0 -17.0
4 2013 1 1 -6.0 -25.0 762 116.0 -19.0
... ... ... ... ... ... ... ... ...
336771 2013 9 30 NaN NaN 213 NaN NaN
336772 2013 9 30 NaN NaN 198 NaN NaN
336773 2013 9 30 NaN NaN 764 NaN NaN
336774 2013 9 30 NaN NaN 419 NaN NaN
336775 2013 9 30 NaN NaN 431 NaN NaN
speed
0 370.044053
1 374.273128
2 408.375000
3 516.721311
4 394.137931
... ...
336771 NaN
336772 NaN
336773 NaN
336774 NaN
336775 NaN
[336776 rows x 9 columns]
336776×8 DataFrame
Row │ year month day dep_delay arr_delay distance air_time gain ⋯
│ Int64 Int64 Int64 Float64? Float64? Float64 Float64? Floa ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 2.0 11.0 1400.0 227.0 ⋯
2 │ 2013 1 1 4.0 20.0 1416.0 227.0
3 │ 2013 1 1 2.0 33.0 1089.0 160.0
4 │ 2013 1 1 -1.0 -18.0 1576.0 183.0
5 │ 2013 1 1 -6.0 -25.0 762.0 116.0 ⋯
6 │ 2013 1 1 -4.0 12.0 719.0 150.0
7 │ 2013 1 1 -5.0 19.0 1065.0 158.0
8 │ 2013 1 1 -3.0 -14.0 229.0 53.0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
336770 │ 2013 9 30 -10.0 -25.0 1617.0 196.0 ⋯
336771 │ 2013 9 30 missing missing 764.0 missing miss
336772 │ 2013 9 30 missing missing 213.0 missing miss
336773 │ 2013 9 30 missing missing 198.0 missing miss
336774 │ 2013 9 30 missing missing 764.0 missing miss ⋯
336775 │ 2013 9 30 missing missing 419.0 missing miss
336776 │ 2013 9 30 missing missing 431.0 missing miss
1 column and 336761 rows omitted
Find the worst members of each group:
flights_sml |>
group_by (year, month, day) |>
filter (rank (desc (arr_delay)) < 10 )
# A tibble: 3,306 × 7
# Groups: year, month, day [365]
year month day dep_delay arr_delay distance air_time
<int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 853 851 184 41
2 2013 1 1 290 338 1134 213
3 2013 1 1 260 263 266 46
4 2013 1 1 157 174 213 60
5 2013 1 1 216 222 708 121
6 2013 1 1 255 250 589 115
7 2013 1 1 285 246 1085 146
8 2013 1 1 192 191 199 44
9 2013 1 1 379 456 1092 222
10 2013 1 2 224 207 550 94
# ℹ 3,296 more rows
flights_sml.groupby(
['year' , 'month' , 'day' ]
)['arr_delay' ].nlargest(
n = 10
)
year month day
2013 1 1 151 851.0
834 456.0
649 338.0
673 263.0
801 250.0
...
12 31 111017 155.0
110962 142.0
110653 140.0
111217 136.0
111144 122.0
Name: arr_delay, Length: 3650, dtype: float64
@pipe flights_sml |>
dropmissing (_, : arr_delay) |>
groupby (_, [: year, : month, : day]) |>
combine (
_,
: arr_delay => (x -> x[x .>= partialsort (x, 10 , rev = true )])
)
3697×4 DataFrame
Row │ year month day arr_delay_function
│ Int64 Int64 Int64 Float64
──────┼─────────────────────────────────────────
1 │ 2013 1 1 851.0
2 │ 2013 1 1 338.0
3 │ 2013 1 1 263.0
4 │ 2013 1 1 166.0
5 │ 2013 1 1 174.0
6 │ 2013 1 1 222.0
7 │ 2013 1 1 250.0
8 │ 2013 1 1 246.0
⋮ │ ⋮ ⋮ ⋮ ⋮
3691 │ 2013 12 31 155.0
3692 │ 2013 12 31 231.0
3693 │ 2013 12 31 122.0
3694 │ 2013 12 31 222.0
3695 │ 2013 12 31 188.0
3696 │ 2013 12 31 136.0
3697 │ 2013 12 31 195.0
3682 rows omitted
Find all groups bigger than a threshold:
popular_dests <- flights |>
group_by (dest) |>
filter (n () > 365 ) |>
print (width = Inf )
# A tibble: 332,577 × 19
# Groups: dest [77]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
arr_delay carrier flight tailnum origin dest air_time distance hour minute
<dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
5 -25 DL 461 N668DN LGA ATL 116 762 6 0
6 12 UA 1696 N39463 EWR ORD 150 719 5 58
7 19 B6 507 N516JB EWR FLL 158 1065 6 0
8 -14 EV 5708 N829AS LGA IAD 53 229 6 0
9 -8 B6 79 N593JB JFK MCO 140 944 6 0
10 8 AA 301 N3ALAA LGA ORD 138 733 6 0
time_hour
<dttm>
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
3 2013-01-01 05:00:00
4 2013-01-01 05:00:00
5 2013-01-01 06:00:00
6 2013-01-01 05:00:00
7 2013-01-01 06:00:00
8 2013-01-01 06:00:00
9 2013-01-01 06:00:00
10 2013-01-01 06:00:00
# ℹ 332,567 more rows
popular_dests = flights.groupby('dest' ).filter (lambda x: len (x) > 365 )
popular_dests
year month day dep_time sched_dep_time dep_delay arr_time \
0 2013 1 1 517.0 515 2.0 830.0
1 2013 1 1 533.0 529 4.0 850.0
2 2013 1 1 542.0 540 2.0 923.0
3 2013 1 1 544.0 545 -1.0 1004.0
4 2013 1 1 554.0 600 -6.0 812.0
... ... ... ... ... ... ... ...
336771 2013 9 30 NaN 1455 NaN NaN
336772 2013 9 30 NaN 2200 NaN NaN
336773 2013 9 30 NaN 1210 NaN NaN
336774 2013 9 30 NaN 1159 NaN NaN
336775 2013 9 30 NaN 840 NaN NaN
sched_arr_time arr_delay carrier flight tailnum origin dest \
0 819 11.0 UA 1545 N14228 EWR IAH
1 830 20.0 UA 1714 N24211 LGA IAH
2 850 33.0 AA 1141 N619AA JFK MIA
3 1022 -18.0 B6 725 N804JB JFK BQN
4 837 -25.0 DL 461 N668DN LGA ATL
... ... ... ... ... ... ... ...
336771 1634 NaN 9E 3393 NaN JFK DCA
336772 2312 NaN 9E 3525 NaN LGA SYR
336773 1330 NaN MQ 3461 N535MQ LGA BNA
336774 1344 NaN MQ 3572 N511MQ LGA CLE
336775 1020 NaN MQ 3531 N839MQ LGA RDU
air_time distance hour minute time_hour
0 227.0 1400 5 15 2013-01-01 10:00:00+00:00
1 227.0 1416 5 29 2013-01-01 10:00:00+00:00
2 160.0 1089 5 40 2013-01-01 10:00:00+00:00
3 183.0 1576 5 45 2013-01-01 10:00:00+00:00
4 116.0 762 6 0 2013-01-01 11:00:00+00:00
... ... ... ... ... ...
336771 NaN 213 14 55 2013-09-30 18:00:00+00:00
336772 NaN 198 22 0 2013-10-01 02:00:00+00:00
336773 NaN 764 12 10 2013-09-30 16:00:00+00:00
336774 NaN 419 11 59 2013-09-30 15:00:00+00:00
336775 NaN 431 8 40 2013-09-30 12:00:00+00:00
[332577 rows x 19 columns]
popular_dests = @pipe flights |>
groupby (_, : dest) |>
combine (_) do sdf
nrow (sdf) > 365 ? sdf : DataFrame ()
end
332577×19 DataFrame
Row │ dest year month day dep_time sched_dep_time dep_delay ar ⋯
│ String Int64 Int64 Int64 Int64? Int64 Float64? In ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ IAH 2013 1 1 517 515 2.0 ⋯
2 │ IAH 2013 1 1 533 529 4.0
3 │ IAH 2013 1 1 623 627 -4.0
4 │ IAH 2013 1 1 728 732 -4.0
5 │ IAH 2013 1 1 739 739 0.0 ⋯
6 │ IAH 2013 1 1 908 908 0.0
7 │ IAH 2013 1 1 1028 1026 2.0
8 │ IAH 2013 1 1 1044 1045 -1.0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
332571 │ BGR 2013 9 27 2319 2205 74.0 ⋯
332572 │ BGR 2013 9 28 1125 1130 -5.0
332573 │ BGR 2013 9 28 1804 1810 -6.0
332574 │ BGR 2013 9 29 1056 1100 -4.0
332575 │ BGR 2013 9 29 2203 2205 -2.0 ⋯
332576 │ BGR 2013 9 30 1057 1100 -3.0
332577 │ BGR 2013 9 30 2203 2205 -2.0
12 columns and 332562 rows omitted
332577×19 DataFrame
Row │ dest year month day dep_time sched_dep_time dep_delay ar ⋯
│ String Int64 Int64 Int64 Int64? Int64 Float64? In ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ IAH 2013 1 1 517 515 2.0 ⋯
2 │ IAH 2013 1 1 533 529 4.0
3 │ IAH 2013 1 1 623 627 -4.0
4 │ IAH 2013 1 1 728 732 -4.0
5 │ IAH 2013 1 1 739 739 0.0 ⋯
6 │ IAH 2013 1 1 908 908 0.0
7 │ IAH 2013 1 1 1028 1026 2.0
8 │ IAH 2013 1 1 1044 1045 -1.0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
332571 │ BGR 2013 9 27 2319 2205 74.0 ⋯
332572 │ BGR 2013 9 28 1125 1130 -5.0
332573 │ BGR 2013 9 28 1804 1810 -6.0
332574 │ BGR 2013 9 29 1056 1100 -4.0
332575 │ BGR 2013 9 29 2203 2205 -2.0 ⋯
332576 │ BGR 2013 9 30 1057 1100 -3.0
332577 │ BGR 2013 9 30 2203 2205 -2.0
12 columns and 332562 rows omitted
Standardise to compute per group metrics:
popular_dests <- popular_dests |>
filter (arr_delay > 0 ) |>
mutate (prop_delay = arr_delay / sum (arr_delay)) |>
select (year: day, dest, arr_delay, prop_delay) |>
print (width = Inf )
# A tibble: 131,106 × 6
# Groups: dest [77]
year month day dest arr_delay prop_delay
<int> <int> <int> <chr> <dbl> <dbl>
1 2013 1 1 IAH 11 0.000111
2 2013 1 1 IAH 20 0.000201
3 2013 1 1 MIA 33 0.000235
4 2013 1 1 ORD 12 0.0000424
5 2013 1 1 FLL 19 0.0000938
6 2013 1 1 ORD 8 0.0000283
7 2013 1 1 LAX 7 0.0000344
8 2013 1 1 DFW 31 0.000282
9 2013 1 1 ATL 12 0.0000400
10 2013 1 1 DTW 16 0.000116
# ℹ 131,096 more rows
popular_dests[popular_dests['arr_delay' ] > 0 ].groupby(
'dest'
).apply (
lambda x: x['arr_delay' ] / x['arr_delay' ].sum ()
)
dest
ALB 569 0.004175
784 0.004593
1536 0.007411
1757 0.008559
2425 0.004175
...
XNA 325031 0.006818
326901 0.001737
330832 0.000386
332101 0.007204
335551 0.000257
Name: arr_delay, Length: 131106, dtype: float64
@pipe popular_dests |>
dropmissing (_, : arr_delay) |>
subset (_, : arr_delay => x -> x .> 0 ) |>
groupby (_, : dest) |>
combine (_, : arr_delay => (x -> x ./ sum (x)) => : prop_delay)
131106×2 DataFrame
Row │ dest prop_delay
│ String Float64
────────┼─────────────────────
1 │ IAH 0.000110674
2 │ IAH 0.000201225
3 │ IAH 1.00613e-5
4 │ IAH 3.01838e-5
5 │ IAH 0.000261593
6 │ IAH 9.05515e-5
7 │ IAH 0.000110674
8 │ IAH 1.00613e-5
⋮ │ ⋮ ⋮
131100 │ BGR 0.000432277
131101 │ BGR 0.0054755
131102 │ BGR 0.00273775
131103 │ BGR 0.00345821
131104 │ BGR 0.00907781
131105 │ BGR 0.0018732
131106 │ BGR 0.00115274
131091 rows omitted
Combine tables
nycflights13 package has >1 tables:
We already know a lot about flights:
flights |> print (width = Inf )
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
arr_delay carrier flight tailnum origin dest air_time distance hour minute
<dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
5 -25 DL 461 N668DN LGA ATL 116 762 6 0
6 12 UA 1696 N39463 EWR ORD 150 719 5 58
7 19 B6 507 N516JB EWR FLL 158 1065 6 0
8 -14 EV 5708 N829AS LGA IAD 53 229 6 0
9 -8 B6 79 N593JB JFK MCO 140 944 6 0
10 8 AA 301 N3ALAA LGA ORD 138 733 6 0
time_hour
<dttm>
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
3 2013-01-01 05:00:00
4 2013-01-01 05:00:00
5 2013-01-01 06:00:00
6 2013-01-01 05:00:00
7 2013-01-01 06:00:00
8 2013-01-01 06:00:00
9 2013-01-01 06:00:00
10 2013-01-01 06:00:00
# ℹ 336,766 more rows
year month day dep_time sched_dep_time dep_delay arr_time \
0 2013 1 1 517.0 515 2.0 830.0
1 2013 1 1 533.0 529 4.0 850.0
2 2013 1 1 542.0 540 2.0 923.0
3 2013 1 1 544.0 545 -1.0 1004.0
4 2013 1 1 554.0 600 -6.0 812.0
... ... ... ... ... ... ... ...
336771 2013 9 30 NaN 1455 NaN NaN
336772 2013 9 30 NaN 2200 NaN NaN
336773 2013 9 30 NaN 1210 NaN NaN
336774 2013 9 30 NaN 1159 NaN NaN
336775 2013 9 30 NaN 840 NaN NaN
sched_arr_time arr_delay carrier flight tailnum origin dest \
0 819 11.0 UA 1545 N14228 EWR IAH
1 830 20.0 UA 1714 N24211 LGA IAH
2 850 33.0 AA 1141 N619AA JFK MIA
3 1022 -18.0 B6 725 N804JB JFK BQN
4 837 -25.0 DL 461 N668DN LGA ATL
... ... ... ... ... ... ... ...
336771 1634 NaN 9E 3393 NaN JFK DCA
336772 2312 NaN 9E 3525 NaN LGA SYR
336773 1330 NaN MQ 3461 N535MQ LGA BNA
336774 1344 NaN MQ 3572 N511MQ LGA CLE
336775 1020 NaN MQ 3531 N839MQ LGA RDU
air_time distance hour minute time_hour
0 227.0 1400 5 15 2013-01-01 10:00:00+00:00
1 227.0 1416 5 29 2013-01-01 10:00:00+00:00
2 160.0 1089 5 40 2013-01-01 10:00:00+00:00
3 183.0 1576 5 45 2013-01-01 10:00:00+00:00
4 116.0 762 6 0 2013-01-01 11:00:00+00:00
... ... ... ... ... ...
336771 NaN 213 14 55 2013-09-30 18:00:00+00:00
336772 NaN 198 22 0 2013-10-01 02:00:00+00:00
336773 NaN 764 12 10 2013-09-30 16:00:00+00:00
336774 NaN 419 11 59 2013-09-30 15:00:00+00:00
336775 NaN 431 8 40 2013-09-30 12:00:00+00:00
[336776 rows x 19 columns]
336776×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 517 515 2.0 830 ⋯
2 │ 2013 1 1 533 529 4.0 850
3 │ 2013 1 1 542 540 2.0 923
4 │ 2013 1 1 544 545 -1.0 1004
5 │ 2013 1 1 554 600 -6.0 812 ⋯
6 │ 2013 1 1 554 558 -4.0 740
7 │ 2013 1 1 555 600 -5.0 913
8 │ 2013 1 1 557 600 -3.0 709
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
336770 │ 2013 9 30 2349 2359 -10.0 325 ⋯
336771 │ 2013 9 30 missing 1842 missing missing
336772 │ 2013 9 30 missing 1455 missing missing
336773 │ 2013 9 30 missing 2200 missing missing
336774 │ 2013 9 30 missing 1210 missing missing ⋯
336775 │ 2013 9 30 missing 1159 missing missing
336776 │ 2013 9 30 missing 840 missing missing
12 columns and 336761 rows omitted
# A tibble: 16 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
7 F9 Frontier Airlines Inc.
8 FL AirTran Airways Corporation
9 HA Hawaiian Airlines Inc.
10 MQ Envoy Air
11 OO SkyWest Airlines Inc.
12 UA United Air Lines Inc.
13 US US Airways Inc.
14 VX Virgin America
15 WN Southwest Airlines Co.
16 YV Mesa Airlines Inc.
from nycflights13 import airlines
airlines
carrier name
0 9E Endeavor Air Inc.
1 AA American Airlines Inc.
2 AS Alaska Airlines Inc.
3 B6 JetBlue Airways
4 DL Delta Air Lines Inc.
5 EV ExpressJet Airlines Inc.
6 F9 Frontier Airlines Inc.
7 FL AirTran Airways Corporation
8 HA Hawaiian Airlines Inc.
9 MQ Envoy Air
10 OO SkyWest Airlines Inc.
11 UA United Air Lines Inc.
12 US US Airways Inc.
13 VX Virgin America
14 WN Southwest Airlines Co.
15 YV Mesa Airlines Inc.
airlines = rcopy (R"airlines" )
16×2 DataFrame
Row │ carrier name
│ String String
─────┼──────────────────────────────────────
1 │ 9E Endeavor Air Inc.
2 │ AA American Airlines Inc.
3 │ AS Alaska Airlines Inc.
4 │ B6 JetBlue Airways
5 │ DL Delta Air Lines Inc.
6 │ EV ExpressJet Airlines Inc.
7 │ F9 Frontier Airlines Inc.
8 │ FL AirTran Airways Corporation
9 │ HA Hawaiian Airlines Inc.
10 │ MQ Envoy Air
11 │ OO SkyWest Airlines Inc.
12 │ UA United Air Lines Inc.
13 │ US US Airways Inc.
14 │ VX Virgin America
15 │ WN Southwest Airlines Co.
16 │ YV Mesa Airlines Inc.
# A tibble: 1,458 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
# ℹ 1,448 more rows
from nycflights13 import airports
airports
faa name lat lon alt tz dst \
0 04G Lansdowne Airport 41.130472 -80.619583 1044 -5 A
1 06A Moton Field Municipal Airport 32.460572 -85.680028 264 -6 A
2 06C Schaumburg Regional 41.989341 -88.101243 801 -6 A
3 06N Randall Airport 41.431912 -74.391561 523 -5 A
4 09J Jekyll Island Airport 31.074472 -81.427778 11 -5 A
... ... ... ... ... ... .. ..
1453 ZUN Black Rock 35.083228 -108.791778 6454 -7 A
1454 ZVE New Haven Rail Station 41.298669 -72.925992 7 -5 A
1455 ZWI Wilmington Amtrak Station 39.736667 -75.551667 0 -5 A
1456 ZWU Washington Union Station 38.897460 -77.006430 76 -5 A
1457 ZYP Penn Station 40.750500 -73.993500 35 -5 A
tzone
0 America/New_York
1 America/Chicago
2 America/Chicago
3 America/New_York
4 America/New_York
... ...
1453 America/Denver
1454 America/New_York
1455 America/New_York
1456 America/New_York
1457 America/New_York
[1458 rows x 8 columns]
airports = rcopy (R"airports" )
1458×8 DataFrame
Row │ faa name lat lon alt t ⋯
│ String String Float64 Float64 Float64 F ⋯
──────┼─────────────────────────────────────────────────────────────────────────
1 │ 04G Lansdowne Airport 41.1305 -80.6196 1044.0 ⋯
2 │ 06A Moton Field Municipal Airport 32.4606 -85.68 264.0
3 │ 06C Schaumburg Regional 41.9893 -88.1012 801.0
4 │ 06N Randall Airport 41.4319 -74.3916 523.0
5 │ 09J Jekyll Island Airport 31.0745 -81.4278 11.0 ⋯
6 │ 0A9 Elizabethton Municipal Airport 36.3712 -82.1734 1593.0
7 │ 0G6 Williams County Airport 41.4673 -84.5068 730.0
8 │ 0G7 Finger Lakes Regional Airport 42.8836 -76.7812 492.0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
1452 │ ZTF Stamford Amtrak Station 41.0469 -73.5415 0.0 ⋯
1453 │ ZTY Boston Back Bay Station 42.3478 -71.075 20.0
1454 │ ZUN Black Rock 35.0832 -108.792 6454.0
1455 │ ZVE New Haven Rail Station 41.2987 -72.926 7.0
1456 │ ZWI Wilmington Amtrak Station 39.7367 -75.5517 0.0 ⋯
1457 │ ZWU Washington Union Station 38.8975 -77.0064 76.0
1458 │ ZYP Penn Station 40.7505 -73.9935 35.0
3 columns and 1443 rows omitted
# A tibble: 3,322 × 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
2 N102UW 1998 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
3 N103US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
4 N104UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
5 N10575 2002 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
6 N105UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
7 N107US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
8 N108UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
9 N109UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
10 N110UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
# ℹ 3,312 more rows
from nycflights13 import planes
planes
tailnum year type manufacturer \
0 N10156 2004.0 Fixed wing multi engine EMBRAER
1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE
2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE
3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE
4 N10575 2002.0 Fixed wing multi engine EMBRAER
... ... ... ... ...
3317 N997AT 2002.0 Fixed wing multi engine BOEING
3318 N997DL 1992.0 Fixed wing multi engine MCDONNELL DOUGLAS AIRCRAFT CO
3319 N998AT 2002.0 Fixed wing multi engine BOEING
3320 N998DL 1992.0 Fixed wing multi engine MCDONNELL DOUGLAS CORPORATION
3321 N999DN 1992.0 Fixed wing multi engine MCDONNELL DOUGLAS CORPORATION
model engines seats speed engine
0 EMB-145XR 2 55 NaN Turbo-fan
1 A320-214 2 182 NaN Turbo-fan
2 A320-214 2 182 NaN Turbo-fan
3 A320-214 2 182 NaN Turbo-fan
4 EMB-145LR 2 55 NaN Turbo-fan
... ... ... ... ... ...
3317 717-200 2 100 NaN Turbo-fan
3318 MD-88 2 142 NaN Turbo-fan
3319 717-200 2 100 NaN Turbo-fan
3320 MD-88 2 142 NaN Turbo-jet
3321 MD-88 2 142 NaN Turbo-jet
[3322 rows x 9 columns]
planes = rcopy (R"planes" )
3322×9 DataFrame
Row │ tailnum year type manufacturer ⋯
│ String Int64? String String ⋯
──────┼─────────────────────────────────────────────────────────────────────────
1 │ N10156 2004 Fixed wing multi engine EMBRAER ⋯
2 │ N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE
3 │ N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE
4 │ N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE
5 │ N10575 2002 Fixed wing multi engine EMBRAER ⋯
6 │ N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE
7 │ N107US 1999 Fixed wing multi engine AIRBUS INDUSTRIE
8 │ N108UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱
3316 │ N996AT 2002 Fixed wing multi engine BOEING ⋯
3317 │ N996DL 1991 Fixed wing multi engine MCDONNELL DOUGLAS AIRCRAFT C
3318 │ N997AT 2002 Fixed wing multi engine BOEING
3319 │ N997DL 1992 Fixed wing multi engine MCDONNELL DOUGLAS AIRCRAFT C
3320 │ N998AT 2002 Fixed wing multi engine BOEING ⋯
3321 │ N998DL 1992 Fixed wing multi engine MCDONNELL DOUGLAS CORPORATIO
3322 │ N999DN 1992 Fixed wing multi engine MCDONNELL DOUGLAS CORPORATIO
6 columns and 3307 rows omitted
weather |>
print (width = Inf )
# A tibble: 26,115 × 15
origin year month day hour temp dewp humid wind_dir wind_speed
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
wind_gust precip pressure visib time_hour
<dbl> <dbl> <dbl> <dbl> <dttm>
1 NA 0 1012 10 2013-01-01 01:00:00
2 NA 0 1012. 10 2013-01-01 02:00:00
3 NA 0 1012. 10 2013-01-01 03:00:00
4 NA 0 1012. 10 2013-01-01 04:00:00
5 NA 0 1012. 10 2013-01-01 05:00:00
6 NA 0 1012. 10 2013-01-01 06:00:00
7 NA 0 1012. 10 2013-01-01 07:00:00
8 NA 0 1012. 10 2013-01-01 08:00:00
9 NA 0 1013. 10 2013-01-01 09:00:00
10 NA 0 1012. 10 2013-01-01 10:00:00
# ℹ 26,105 more rows
from nycflights13 import weather
weather
origin year month day hour temp dewp humid wind_dir \
0 EWR 2013 1 1 1 39.02 26.06 59.37 270.0
1 EWR 2013 1 1 2 39.02 26.96 61.63 250.0
2 EWR 2013 1 1 3 39.02 28.04 64.43 240.0
3 EWR 2013 1 1 4 39.92 28.04 62.21 250.0
4 EWR 2013 1 1 5 39.02 28.04 64.43 260.0
... ... ... ... ... ... ... ... ... ...
26110 LGA 2013 12 30 14 35.96 19.94 51.78 340.0
26111 LGA 2013 12 30 15 33.98 17.06 49.51 330.0
26112 LGA 2013 12 30 16 32.00 15.08 49.19 340.0
26113 LGA 2013 12 30 17 30.92 12.92 46.74 320.0
26114 LGA 2013 12 30 18 28.94 10.94 46.41 330.0
wind_speed wind_gust precip pressure visib time_hour
0 10.35702 NaN 0.0 1012.0 10.0 2013-01-01T06:00:00Z
1 8.05546 NaN 0.0 1012.3 10.0 2013-01-01T07:00:00Z
2 11.50780 NaN 0.0 1012.5 10.0 2013-01-01T08:00:00Z
3 12.65858 NaN 0.0 1012.2 10.0 2013-01-01T09:00:00Z
4 12.65858 NaN 0.0 1011.9 10.0 2013-01-01T10:00:00Z
... ... ... ... ... ... ...
26110 13.80936 21.86482 0.0 1017.1 10.0 2013-12-30T19:00:00Z
26111 17.26170 21.86482 0.0 1018.8 10.0 2013-12-30T20:00:00Z
26112 14.96014 23.01560 0.0 1019.5 10.0 2013-12-30T21:00:00Z
26113 17.26170 NaN 0.0 1019.9 10.0 2013-12-30T22:00:00Z
26114 18.41248 NaN 0.0 1020.9 10.0 2013-12-30T23:00:00Z
[26115 rows x 15 columns]
weather = rcopy (R"weather" )
26115×15 DataFrame
Row │ origin year month day hour temp dewp humid win ⋯
│ String Int64 Int64 Int64 Int64 Float64? Float64? Float64? Flo ⋯
───────┼────────────────────────────────────────────────────────────────────────
1 │ EWR 2013 1 1 1 39.02 26.06 59.37 ⋯
2 │ EWR 2013 1 1 2 39.02 26.96 61.63
3 │ EWR 2013 1 1 3 39.02 28.04 64.43
4 │ EWR 2013 1 1 4 39.92 28.04 62.21
5 │ EWR 2013 1 1 5 39.02 28.04 64.43 ⋯
6 │ EWR 2013 1 1 6 37.94 28.04 67.21
7 │ EWR 2013 1 1 7 39.02 28.04 64.43
8 │ EWR 2013 1 1 8 39.92 28.04 62.21
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
26109 │ LGA 2013 12 30 12 37.94 23.0 54.51 ⋯
26110 │ LGA 2013 12 30 13 37.04 21.92 53.97
26111 │ LGA 2013 12 30 14 35.96 19.94 51.78
26112 │ LGA 2013 12 30 15 33.98 17.06 49.51
26113 │ LGA 2013 12 30 16 32.0 15.08 49.19 ⋯
26114 │ LGA 2013 12 30 17 30.92 12.92 46.74
26115 │ LGA 2013 12 30 18 28.94 10.94 46.41
7 columns and 26100 rows omitted
Combine variables (columns)
Demo tables
(x <- tribble (
~ key, ~ val_x,
1 , "x1" ,
2 , "x2" ,
3 , "x3"
))
# A tibble: 3 × 2
key val_x
<dbl> <chr>
1 1 x1
2 2 x2
3 3 x3
(y <- tribble (
~ key, ~ val_y,
1 , "y1" ,
2 , "y2" ,
4 , "y3"
))
# A tibble: 3 × 2
key val_y
<dbl> <chr>
1 1 y1
2 2 y2
3 4 y3
x = pd.DataFrame({
'key' : [1 , 2 , 4 ],
'val_x' : ['x1' , 'x2' , 'x3' ]
})
x
key val_x
0 1 x1
1 2 x2
2 4 x3
y = pd.DataFrame({
'key' : [1 , 2 , 3 ],
'val_y' : ['y1' , 'y2' , 'y3' ]
})
x
key val_x
0 1 x1
1 2 x2
2 4 x3
x = DataFrame (
key = 1 : 3 ,
val_x = ["x1" , "x2" , "x3" ]
)
3×2 DataFrame
Row │ key val_x
│ Int64 String
─────┼───────────────
1 │ 1 x1
2 │ 2 x2
3 │ 3 x3
y = DataFrame (
key = [1 , 2 , 4 ],
val_y = ["y1" , "y2" , "y3" ]
)
3×2 DataFrame
Row │ key val_y
│ Int64 String
─────┼───────────────
1 │ 1 y1
2 │ 2 y2
3 │ 4 y3
Inner join
An inner join matches pairs of observations whenever their keys are equal:
inner_join (x, y, by = "key" )
# A tibble: 2 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
Same as
x |> inner_join (y, by = "key" )
x.join(y.set_index('key' ), on = 'key' , how = 'inner' )
key val_x val_y
0 1 x1 y1
1 2 x2 y2
innerjoin (x, y, on = : key)
2×3 DataFrame
Row │ key val_x val_y
│ Int64 String String
─────┼───────────────────────
1 │ 1 x1 y1
2 │ 2 x2 y2
Outer join
An outer join keeps observations that appear in at least one of the tables.
Three types of outer joins: left join, right join, and full join.
A left join keeps all observations in x
.
left_join (x, y, by = "key" )
# A tibble: 3 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
x.join(y.set_index('key' ), on = 'key' , how = 'left' )
key val_x val_y
0 1 x1 y1
1 2 x2 y2
2 4 x3 NaN
leftjoin (x, y, on = : key)
3×3 DataFrame
Row │ key val_x val_y
│ Int64 String String?
─────┼────────────────────────
1 │ 1 x1 y1
2 │ 2 x2 y2
3 │ 3 x3 missing
A right join keeps all observations in y
.
right_join (x, y, by = "key" )
# A tibble: 3 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 4 <NA> y3
x.join(y.set_index('key' ), on = 'key' , how = 'right' )
key val_x val_y
0.0 1 x1 y1
1.0 2 x2 y2
NaN 3 NaN y3
rightjoin (x, y, on = : key)
3×3 DataFrame
Row │ key val_x val_y
│ Int64 String? String
─────┼────────────────────────
1 │ 1 x1 y1
2 │ 2 x2 y2
3 │ 4 missing y3
A full join keeps all observations in x
or y
.
full_join (x, y, by = "key" )
# A tibble: 4 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
4 4 <NA> y3
x.join(y.set_index('key' ), on = 'key' , how = 'outer' )
key val_x val_y
0.0 1 x1 y1
1.0 2 x2 y2
2.0 4 x3 NaN
NaN 3 NaN y3
outerjoin (x, y, on = : key)
4×3 DataFrame
Row │ key val_x val_y
│ Int64 String? String?
─────┼─────────────────────────
1 │ 1 x1 y1
2 │ 2 x2 y2
3 │ 3 x3 missing
4 │ 4 missing y3
Duplicate keys
x <- tribble (
~ key, ~ val_x,
1 , "x1" ,
2 , "x2" ,
2 , "x3" ,
1 , "x4"
)
x
# A tibble: 4 × 2
key val_x
<dbl> <chr>
1 1 x1
2 2 x2
3 2 x3
4 1 x4
y <- tribble (
~ key, ~ val_y,
1 , "y1" ,
2 , "y2"
)
y
# A tibble: 2 × 2
key val_y
<dbl> <chr>
1 1 y1
2 2 y2
left_join (x, y, by = "key" )
# A tibble: 4 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 2 x3 y2
4 1 x4 y1
x = pd.DataFrame({
'key' : [1 , 2 , 2 , 1 ],
'val_x' : ["x1" , "x2" , "x3" , "x4" ]
})
x
key val_x
0 1 x1
1 2 x2
2 2 x3
3 1 x4
y = pd.DataFrame({
'key' : [1 , 2 ],
'val_y' : ["y1" , "y2" ]
})
y
x.join(y.set_index('key' ), on = 'key' , how = 'left' )
key val_x val_y
0 1 x1 y1
1 2 x2 y2
2 2 x3 y2
3 1 x4 y1
x = DataFrame (
key = [1 , 2 , 2 , 1 ],
val_x = ["x1" , "x2" , "x3" , "x4" ]
)
4×2 DataFrame
Row │ key val_x
│ Int64 String
─────┼───────────────
1 │ 1 x1
2 │ 2 x2
3 │ 2 x3
4 │ 1 x4
y = DataFrame (
key = [1 , 2 ],
val_y = ["y1" , "y2" ]
)
2×2 DataFrame
Row │ key val_y
│ Int64 String
─────┼───────────────
1 │ 1 y1
2 │ 2 y2
leftjoin (x, y, on = : key)
4×3 DataFrame
Row │ key val_x val_y
│ Int64 String String?
─────┼────────────────────────
1 │ 1 x1 y1
2 │ 2 x2 y2
3 │ 2 x3 y2
4 │ 1 x4 y1
x <- tribble (
~ key, ~ val_x,
1 , "x1" ,
2 , "x2" ,
2 , "x3" ,
3 , "x4"
)
y <- tribble (
~ key, ~ val_y,
1 , "y1" ,
2 , "y2" ,
2 , "y3" ,
3 , "y4"
)
left_join (x, y, by = "key" )
Warning in left_join(x, y, by = "key"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 2 of `x` matches multiple rows in `y`.
ℹ Row 2 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# A tibble: 6 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 2 x2 y3
4 2 x3 y2
5 2 x3 y3
6 3 x4 y4
x = pd.DataFrame({
'key' : [1 , 2 , 2 , 3 ],
'val_x' : ["x1" , "x2" , "x3" , "x4" ]
})
x
key val_x
0 1 x1
1 2 x2
2 2 x3
3 3 x4
y = pd.DataFrame({
'key' : [1 , 2 , 2 , 3 ],
'val_y' : ["y1" , "y2" , "y3" , "y4" ]
})
y
key val_y
0 1 y1
1 2 y2
2 2 y3
3 3 y4
x.join(y.set_index('key' ), on = 'key' , how = 'left' )
key val_x val_y
0 1 x1 y1
1 2 x2 y2
1 2 x2 y3
2 2 x3 y2
2 2 x3 y3
3 3 x4 y4
x = DataFrame (
key = [1 , 2 , 2 , 3 ],
val_x = ["x1" , "x2" , "x3" , "x4" ]
)
4×2 DataFrame
Row │ key val_x
│ Int64 String
─────┼───────────────
1 │ 1 x1
2 │ 2 x2
3 │ 2 x3
4 │ 3 x4
y = DataFrame (
key = [1 , 2 , 2 , 3 ],
val_y = ["y1" , "y2" , "y3" , "y4" ]
)
4×2 DataFrame
Row │ key val_y
│ Int64 String
─────┼───────────────
1 │ 1 y1
2 │ 2 y2
3 │ 2 y3
4 │ 3 y4
leftjoin (x, y, on = : key)
6×3 DataFrame
Row │ key val_x val_y
│ Int64 String String?
─────┼────────────────────────
1 │ 1 x1 y1
2 │ 2 x2 y2
3 │ 2 x3 y2
4 │ 2 x2 y3
5 │ 2 x3 y3
6 │ 3 x4 y4
Let’s create a narrower table from the flights data:
flights2 <- flights |>
select (year: day, hour, origin, dest, tailnum, carrier) |>
print (width = Inf )
# A tibble: 336,776 × 8
year month day hour origin dest tailnum carrier
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA
2 2013 1 1 5 LGA IAH N24211 UA
3 2013 1 1 5 JFK MIA N619AA AA
4 2013 1 1 5 JFK BQN N804JB B6
5 2013 1 1 6 LGA ATL N668DN DL
6 2013 1 1 5 EWR ORD N39463 UA
7 2013 1 1 6 EWR FLL N516JB B6
8 2013 1 1 6 LGA IAD N829AS EV
9 2013 1 1 6 JFK MCO N593JB B6
10 2013 1 1 6 LGA ORD N3ALAA AA
# ℹ 336,766 more rows
flights2 = flights[['year' , 'month' , 'day' , 'hour' , 'origin' , 'dest' , 'tailnum' , 'carrier' ]]
flights2
year month day hour origin dest tailnum carrier
0 2013 1 1 5 EWR IAH N14228 UA
1 2013 1 1 5 LGA IAH N24211 UA
2 2013 1 1 5 JFK MIA N619AA AA
3 2013 1 1 5 JFK BQN N804JB B6
4 2013 1 1 6 LGA ATL N668DN DL
... ... ... ... ... ... ... ... ...
336771 2013 9 30 14 JFK DCA NaN 9E
336772 2013 9 30 22 LGA SYR NaN 9E
336773 2013 9 30 12 LGA BNA N535MQ MQ
336774 2013 9 30 11 LGA CLE N511MQ MQ
336775 2013 9 30 8 LGA RDU N839MQ MQ
[336776 rows x 8 columns]
flights2 = select (
flights,
Between (: year, : day),
: hour,
: origin,
: dest,
: tailnum,
: carrier
)
336776×8 DataFrame
Row │ year month day hour origin dest tailnum carrier
│ Int64 Int64 Int64 Float64 String String String? String
────────┼────────────────────────────────────────────────────────────────
1 │ 2013 1 1 5.0 EWR IAH N14228 UA
2 │ 2013 1 1 5.0 LGA IAH N24211 UA
3 │ 2013 1 1 5.0 JFK MIA N619AA AA
4 │ 2013 1 1 5.0 JFK BQN N804JB B6
5 │ 2013 1 1 6.0 LGA ATL N668DN DL
6 │ 2013 1 1 5.0 EWR ORD N39463 UA
7 │ 2013 1 1 6.0 EWR FLL N516JB B6
8 │ 2013 1 1 6.0 LGA IAD N829AS EV
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
336770 │ 2013 9 30 23.0 JFK PSE N516JB B6
336771 │ 2013 9 30 18.0 LGA BNA N740EV EV
336772 │ 2013 9 30 14.0 JFK DCA missing 9E
336773 │ 2013 9 30 22.0 LGA SYR missing 9E
336774 │ 2013 9 30 12.0 LGA BNA N535MQ MQ
336775 │ 2013 9 30 11.0 LGA CLE N511MQ MQ
336776 │ 2013 9 30 8.0 LGA RDU N839MQ MQ
336761 rows omitted
We want to merge with the weather
table:
# A tibble: 26,115 × 15
origin year month day hour temp dewp humid wind_dir wind_speed
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
# ℹ 26,105 more rows
# ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
# visib <dbl>, time_hour <dttm>
origin year month day hour temp dewp humid wind_dir \
0 EWR 2013 1 1 1 39.02 26.06 59.37 270.0
1 EWR 2013 1 1 2 39.02 26.96 61.63 250.0
2 EWR 2013 1 1 3 39.02 28.04 64.43 240.0
3 EWR 2013 1 1 4 39.92 28.04 62.21 250.0
4 EWR 2013 1 1 5 39.02 28.04 64.43 260.0
... ... ... ... ... ... ... ... ... ...
26110 LGA 2013 12 30 14 35.96 19.94 51.78 340.0
26111 LGA 2013 12 30 15 33.98 17.06 49.51 330.0
26112 LGA 2013 12 30 16 32.00 15.08 49.19 340.0
26113 LGA 2013 12 30 17 30.92 12.92 46.74 320.0
26114 LGA 2013 12 30 18 28.94 10.94 46.41 330.0
wind_speed wind_gust precip pressure visib time_hour
0 10.35702 NaN 0.0 1012.0 10.0 2013-01-01T06:00:00Z
1 8.05546 NaN 0.0 1012.3 10.0 2013-01-01T07:00:00Z
2 11.50780 NaN 0.0 1012.5 10.0 2013-01-01T08:00:00Z
3 12.65858 NaN 0.0 1012.2 10.0 2013-01-01T09:00:00Z
4 12.65858 NaN 0.0 1011.9 10.0 2013-01-01T10:00:00Z
... ... ... ... ... ... ...
26110 13.80936 21.86482 0.0 1017.1 10.0 2013-12-30T19:00:00Z
26111 17.26170 21.86482 0.0 1018.8 10.0 2013-12-30T20:00:00Z
26112 14.96014 23.01560 0.0 1019.5 10.0 2013-12-30T21:00:00Z
26113 17.26170 NaN 0.0 1019.9 10.0 2013-12-30T22:00:00Z
26114 18.41248 NaN 0.0 1020.9 10.0 2013-12-30T23:00:00Z
[26115 rows x 15 columns]
26115×15 DataFrame
Row │ origin year month day hour temp dewp humid win ⋯
│ String Int64 Int64 Int64 Int64 Float64? Float64? Float64? Flo ⋯
───────┼────────────────────────────────────────────────────────────────────────
1 │ EWR 2013 1 1 1 39.02 26.06 59.37 ⋯
2 │ EWR 2013 1 1 2 39.02 26.96 61.63
3 │ EWR 2013 1 1 3 39.02 28.04 64.43
4 │ EWR 2013 1 1 4 39.92 28.04 62.21
5 │ EWR 2013 1 1 5 39.02 28.04 64.43 ⋯
6 │ EWR 2013 1 1 6 37.94 28.04 67.21
7 │ EWR 2013 1 1 7 39.02 28.04 64.43
8 │ EWR 2013 1 1 8 39.92 28.04 62.21
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
26109 │ LGA 2013 12 30 12 37.94 23.0 54.51 ⋯
26110 │ LGA 2013 12 30 13 37.04 21.92 53.97
26111 │ LGA 2013 12 30 14 35.96 19.94 51.78
26112 │ LGA 2013 12 30 15 33.98 17.06 49.51
26113 │ LGA 2013 12 30 16 32.0 15.08 49.19 ⋯
26114 │ LGA 2013 12 30 17 30.92 12.92 46.74
26115 │ LGA 2013 12 30 18 28.94 10.94 46.41
7 columns and 26100 rows omitted
Defining the key columns
by = NULL
(default): use all variables that appear in both tables:
# same as: flights2 |> left_join(weather)
left_join (flights2, weather)
# A tibble: 336,776 × 18
year month day hour origin dest tailnum carrier temp dewp humid
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
7 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
8 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
9 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
10 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
# ℹ 336,766 more rows
# ℹ 7 more variables: wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
# precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
by = "x"
: use the common variable x
:
# same as: flights2 |> left_join(weather)
left_join (flights2, planes, by = "tailnum" )
# A tibble: 336,776 × 16
year.x month day hour origin dest tailnum carrier year.y type
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixed wing mult…
2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixed wing mult…
3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixed wing mult…
4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixed wing mult…
5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixed wing mult…
6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixed wing mult…
7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixed wing mult…
8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixed wing mult…
9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixed wing mult…
10 2013 1 1 6 LGA ORD N3ALAA AA NA <NA>
# ℹ 336,766 more rows
# ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
# seats <int>, speed <int>, engine <chr>
by = c("a" = "b")
: match variable a
in table x
to the variable b
in table y
.
# same as: flights2 |> left_join(weather)
left_join (flights2, airports, by = c ("dest" = "faa" ))
# A tibble: 336,776 × 15
year month day hour origin dest tailnum carrier name lat lon alt
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 1 1 5 EWR IAH N14228 UA Georg… 30.0 -95.3 97
2 2013 1 1 5 LGA IAH N24211 UA Georg… 30.0 -95.3 97
3 2013 1 1 5 JFK MIA N619AA AA Miami… 25.8 -80.3 8
4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA NA
5 2013 1 1 6 LGA ATL N668DN DL Harts… 33.6 -84.4 1026
6 2013 1 1 5 EWR ORD N39463 UA Chica… 42.0 -87.9 668
7 2013 1 1 6 EWR FLL N516JB B6 Fort … 26.1 -80.2 9
8 2013 1 1 6 LGA IAD N829AS EV Washi… 38.9 -77.5 313
9 2013 1 1 6 JFK MCO N593JB B6 Orlan… 28.4 -81.3 96
10 2013 1 1 6 LGA ORD N3ALAA AA Chica… 42.0 -87.9 668
# ℹ 336,766 more rows
# ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>
Match multiple keys using multi-index:
keys = ['origin' , 'year' , 'month' , 'day' , 'hour' ]
flights2.join(
weather.set_index(keys),
on = keys,
how = 'left' )
year month day hour origin dest tailnum carrier temp dewp \
0 2013 1 1 5 EWR IAH N14228 UA 39.02 28.04
1 2013 1 1 5 LGA IAH N24211 UA 39.92 24.98
2 2013 1 1 5 JFK MIA N619AA AA 39.02 26.96
3 2013 1 1 5 JFK BQN N804JB B6 39.02 26.96
4 2013 1 1 6 LGA ATL N668DN DL 39.92 24.98
... ... ... ... ... ... ... ... ... ... ...
336771 2013 9 30 14 JFK DCA NaN 9E 68.00 55.04
336772 2013 9 30 22 LGA SYR NaN 9E 64.94 53.06
336773 2013 9 30 12 LGA BNA N535MQ MQ 69.08 48.02
336774 2013 9 30 11 LGA CLE N511MQ MQ 66.92 48.92
336775 2013 9 30 8 LGA RDU N839MQ MQ 60.98 51.08
humid wind_dir wind_speed wind_gust precip pressure visib \
0 64.43 260.0 12.65858 NaN 0.0 1011.9 10.0
1 54.81 250.0 14.96014 21.86482 0.0 1011.4 10.0
2 61.63 260.0 14.96014 NaN 0.0 1012.1 10.0
3 61.63 260.0 14.96014 NaN 0.0 1012.1 10.0
4 54.81 260.0 16.11092 23.01560 0.0 1011.7 10.0
... ... ... ... ... ... ... ...
336771 63.21 190.0 11.50780 NaN 0.0 1016.6 10.0
336772 65.37 200.0 6.90468 NaN 0.0 1015.8 10.0
336773 46.99 70.0 5.75390 NaN 0.0 1016.7 10.0
336774 52.35 70.0 8.05546 NaN 0.0 1017.5 10.0
336775 69.86 NaN 5.75390 NaN 0.0 1018.6 10.0
time_hour
0 2013-01-01T10:00:00Z
1 2013-01-01T10:00:00Z
2 2013-01-01T10:00:00Z
3 2013-01-01T10:00:00Z
4 2013-01-01T11:00:00Z
... ...
336771 2013-09-30T18:00:00Z
336772 2013-10-01T02:00:00Z
336773 2013-09-30T16:00:00Z
336774 2013-09-30T15:00:00Z
336775 2013-09-30T12:00:00Z
[336776 rows x 18 columns]
Match the common variable tailnum
:
flights2.join(
planes.set_index('tailnum' ),
on = 'tailnum' ,
how = 'left' ,
lsuffix = '_x' ,
rsuffix = '_y'
)
year_x month day hour origin dest tailnum carrier year_y \
0 2013 1 1 5 EWR IAH N14228 UA 1999.0
1 2013 1 1 5 LGA IAH N24211 UA 1998.0
2 2013 1 1 5 JFK MIA N619AA AA 1990.0
3 2013 1 1 5 JFK BQN N804JB B6 2012.0
4 2013 1 1 6 LGA ATL N668DN DL 1991.0
... ... ... ... ... ... ... ... ... ...
336771 2013 9 30 14 JFK DCA NaN 9E NaN
336772 2013 9 30 22 LGA SYR NaN 9E NaN
336773 2013 9 30 12 LGA BNA N535MQ MQ NaN
336774 2013 9 30 11 LGA CLE N511MQ MQ NaN
336775 2013 9 30 8 LGA RDU N839MQ MQ NaN
type manufacturer model engines seats speed \
0 Fixed wing multi engine BOEING 737-824 2.0 149.0 NaN
1 Fixed wing multi engine BOEING 737-824 2.0 149.0 NaN
2 Fixed wing multi engine BOEING 757-223 2.0 178.0 NaN
3 Fixed wing multi engine AIRBUS A320-232 2.0 200.0 NaN
4 Fixed wing multi engine BOEING 757-232 2.0 178.0 NaN
... ... ... ... ... ... ...
336771 NaN NaN NaN NaN NaN NaN
336772 NaN NaN NaN NaN NaN NaN
336773 NaN NaN NaN NaN NaN NaN
336774 NaN NaN NaN NaN NaN NaN
336775 NaN NaN NaN NaN NaN NaN
engine
0 Turbo-fan
1 Turbo-fan
2 Turbo-fan
3 Turbo-fan
4 Turbo-fan
... ...
336771 NaN
336772 NaN
336773 NaN
336774 NaN
336775 NaN
[336776 rows x 16 columns]
Match variable a
in table x
to the variable b
in table y
.
flights2.set_index(
'dest'
).join(
airports.set_index('faa' ),
how = 'left'
)
year month day hour origin tailnum carrier \
dest
ABQ 2013 10 1 20 JFK N554JB B6
ABQ 2013 10 2 20 JFK N607JB B6
ABQ 2013 10 3 20 JFK N591JB B6
ABQ 2013 10 4 20 JFK N662JB B6
ABQ 2013 10 5 19 JFK N580JB B6
... ... ... ... ... ... ... ...
XNA 2013 9 29 17 LGA N725MQ MQ
XNA 2013 9 30 7 LGA N735MQ MQ
XNA 2013 9 30 8 EWR N14117 EV
XNA 2013 9 30 15 LGA N725MQ MQ
XNA 2013 9 30 17 LGA N720MQ MQ
name lat lon alt tz \
dest
ABQ Albuquerque International Sunport 35.040222 -106.609194 5355.0 -7.0
ABQ Albuquerque International Sunport 35.040222 -106.609194 5355.0 -7.0
ABQ Albuquerque International Sunport 35.040222 -106.609194 5355.0 -7.0
ABQ Albuquerque International Sunport 35.040222 -106.609194 5355.0 -7.0
ABQ Albuquerque International Sunport 35.040222 -106.609194 5355.0 -7.0
... ... ... ... ... ...
XNA NW Arkansas Regional 36.281869 -94.306811 1287.0 -6.0
XNA NW Arkansas Regional 36.281869 -94.306811 1287.0 -6.0
XNA NW Arkansas Regional 36.281869 -94.306811 1287.0 -6.0
XNA NW Arkansas Regional 36.281869 -94.306811 1287.0 -6.0
XNA NW Arkansas Regional 36.281869 -94.306811 1287.0 -6.0
dst tzone
dest
ABQ A America/Denver
ABQ A America/Denver
ABQ A America/Denver
ABQ A America/Denver
ABQ A America/Denver
... .. ...
XNA A America/Chicago
XNA A America/Chicago
XNA A America/Chicago
XNA A America/Chicago
XNA A America/Chicago
[336776 rows x 14 columns]
Match multiple variables:
leftjoin (
flights2,
weather,
on = [: year, : month, : day, : hour, : origin]
)
336776×18 DataFrame
Row │ year month day hour origin dest tailnum carrier temp ⋯
│ Int64 Int64 Int64 Float64 String String String? String Floa ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 5.0 EWR IAH N14228 UA ⋯
2 │ 2013 1 1 5.0 LGA IAH N24211 UA
3 │ 2013 1 1 5.0 JFK MIA N619AA AA
4 │ 2013 1 1 5.0 JFK BQN N804JB B6
5 │ 2013 1 1 6.0 LGA ATL N668DN DL ⋯
6 │ 2013 1 1 5.0 EWR ORD N39463 UA
7 │ 2013 1 1 6.0 EWR FLL N516JB B6
8 │ 2013 1 1 6.0 LGA IAD N829AS EV
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
336770 │ 2013 9 2 16.0 EWR CVG N14573 EV miss ⋯
336771 │ 2013 9 2 16.0 EWR SAV N11547 EV miss
336772 │ 2013 9 2 16.0 EWR ROC N11548 EV miss
336773 │ 2013 9 2 16.0 EWR GSO N13978 EV miss
336774 │ 2013 9 2 16.0 EWR ATL N13202 EV miss ⋯
336775 │ 2013 9 2 16.0 EWR CVG missing 9E miss
336776 │ 2013 9 2 16.0 EWR PHX missing US miss
10 columns and 336761 rows omitted
Match the common variable tailnum
:
leftjoin (
flights2,
planes,
on = : tailnum,
makeunique = true ,
matchmissing = : notequal
)
336776×16 DataFrame
Row │ year month day hour origin dest tailnum carrier year ⋯
│ Int64 Int64 Int64 Float64 String String String? String Int6 ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 5.0 EWR IAH N14228 UA 1 ⋯
2 │ 2013 1 1 5.0 LGA IAH N24211 UA 1
3 │ 2013 1 1 5.0 JFK MIA N619AA AA 1
4 │ 2013 1 1 5.0 JFK BQN N804JB B6 2
5 │ 2013 1 1 6.0 LGA ATL N668DN DL 1 ⋯
6 │ 2013 1 1 5.0 EWR ORD N39463 UA 2
7 │ 2013 1 1 6.0 EWR FLL N516JB B6 2
8 │ 2013 1 1 6.0 LGA IAD N829AS EV 1
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
336770 │ 2013 9 30 21.0 JFK DCA N807MQ MQ miss ⋯
336771 │ 2013 9 30 21.0 LGA BNA N532MQ MQ miss
336772 │ 2013 9 30 14.0 JFK DCA missing 9E miss
336773 │ 2013 9 30 22.0 LGA SYR missing 9E miss
336774 │ 2013 9 30 12.0 LGA BNA N535MQ MQ miss ⋯
336775 │ 2013 9 30 11.0 LGA CLE N511MQ MQ miss
336776 │ 2013 9 30 8.0 LGA RDU N839MQ MQ miss
8 columns and 336761 rows omitted
Match variable a
in table x
to the variable b
in table y
.
leftjoin (
flights2,
airports,
on = : dest => : faa
)
336776×15 DataFrame
Row │ year month day hour origin dest tailnum carrier name ⋯
│ Int64 Int64 Int64 Float64 String String String? String Stri ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 5.0 EWR IAH N14228 UA Geor ⋯
2 │ 2013 1 1 5.0 LGA IAH N24211 UA Geor
3 │ 2013 1 1 5.0 JFK MIA N619AA AA Miam
4 │ 2013 1 1 6.0 LGA ATL N668DN DL Hart
5 │ 2013 1 1 5.0 EWR ORD N39463 UA Chic ⋯
6 │ 2013 1 1 6.0 EWR FLL N516JB B6 Fort
7 │ 2013 1 1 6.0 LGA IAD N829AS EV Wash
8 │ 2013 1 1 6.0 JFK MCO N593JB B6 Orla
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
336770 │ 2013 9 30 8.0 JFK SJU N3772H DL miss ⋯
336771 │ 2013 9 30 14.0 JFK SJU N652JB B6 miss
336772 │ 2013 9 30 15.0 JFK SJU N5FTAA AA miss
336773 │ 2013 9 30 15.0 JFK SJU N3756 DL miss
336774 │ 2013 9 30 20.0 JFK SJU N396DA DL miss ⋯
336775 │ 2013 9 30 21.0 JFK SJU N633JB B6 miss
336776 │ 2013 9 30 23.0 JFK PSE N516JB B6 miss
7 columns and 336761 rows omitted
Combine cases (rows)
Top 10 most popular destinations:
top_dest <- flights |>
count (dest, sort = TRUE ) |>
head (10 ) |>
print ()
# A tibble: 10 × 2
dest n
<chr> <int>
1 ORD 17283
2 ATL 17215
3 LAX 16174
4 BOS 15508
5 MCO 14082
6 CLT 14064
7 SFO 13331
8 FLL 12055
9 MIA 11728
10 DCA 9705
top_dest = flights.groupby('dest' )['dest' ].count(
).to_frame(
name = 'n'
).reset_index(
).sort_values(
'n' ,
ascending = False
).head(10 )
top_dest
dest n
69 ORD 17283
4 ATL 17215
49 LAX 16174
11 BOS 15508
54 MCO 14082
23 CLT 14064
90 SFO 13331
35 FLL 12055
58 MIA 11728
28 DCA 9705
top_dest = @pipe flights |>
groupby (_, : dest) |>
combine (_, nrow) |>
sort (_, : nrow, rev = true ) |>
first (_, 10 )
10×2 DataFrame
Row │ dest nrow
│ String Int64
─────┼───────────────
1 │ ORD 17283
2 │ ATL 17215
3 │ LAX 16174
4 │ BOS 15508
5 │ MCO 14082
6 │ CLT 14064
7 │ SFO 13331
8 │ FLL 12055
9 │ MIA 11728
10 │ DCA 9705
How to filter the cases that fly to these destinations?
Semi-join
semi_join (flights, top_dest)
# A tibble: 141,145 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 542 540 2 923 850
2 2013 1 1 554 600 -6 812 837
3 2013 1 1 554 558 -4 740 728
4 2013 1 1 555 600 -5 913 854
5 2013 1 1 557 600 -3 838 846
6 2013 1 1 558 600 -2 753 745
7 2013 1 1 558 600 -2 924 917
8 2013 1 1 558 600 -2 923 937
9 2013 1 1 559 559 0 702 706
10 2013 1 1 600 600 0 851 858
# ℹ 141,135 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights.loc[flights['dest' ].isin(top_dest['dest' ])]
year month day dep_time sched_dep_time dep_delay arr_time \
2 2013 1 1 542.0 540 2.0 923.0
4 2013 1 1 554.0 600 -6.0 812.0
5 2013 1 1 554.0 558 -4.0 740.0
6 2013 1 1 555.0 600 -5.0 913.0
8 2013 1 1 557.0 600 -3.0 838.0
... ... ... ... ... ... ... ...
336755 2013 9 30 2149.0 2156 -7.0 2245.0
336762 2013 9 30 2233.0 2113 80.0 112.0
336763 2013 9 30 2235.0 2001 154.0 59.0
336768 2013 9 30 2307.0 2255 12.0 2359.0
336771 2013 9 30 NaN 1455 NaN NaN
sched_arr_time arr_delay carrier flight tailnum origin dest \
2 850 33.0 AA 1141 N619AA JFK MIA
4 837 -25.0 DL 461 N668DN LGA ATL
5 728 12.0 UA 1696 N39463 EWR ORD
6 854 19.0 B6 507 N516JB EWR FLL
8 846 -8.0 B6 79 N593JB JFK MCO
... ... ... ... ... ... ... ...
336755 2308 -23.0 UA 523 N813UA EWR BOS
336762 30 42.0 UA 471 N578UA EWR SFO
336763 2249 130.0 B6 1083 N804JB JFK MCO
336768 2358 1.0 B6 718 N565JB JFK BOS
336771 1634 NaN 9E 3393 NaN JFK DCA
air_time distance hour minute time_hour
2 160.0 1089 5 40 2013-01-01 10:00:00+00:00
4 116.0 762 6 0 2013-01-01 11:00:00+00:00
5 150.0 719 5 58 2013-01-01 10:00:00+00:00
6 158.0 1065 6 0 2013-01-01 11:00:00+00:00
8 140.0 944 6 0 2013-01-01 11:00:00+00:00
... ... ... ... ... ...
336755 37.0 200 21 56 2013-10-01 01:00:00+00:00
336762 318.0 2565 21 13 2013-10-01 01:00:00+00:00
336763 123.0 944 20 1 2013-10-01 00:00:00+00:00
336768 33.0 187 22 55 2013-10-01 02:00:00+00:00
336771 NaN 213 14 55 2013-09-30 18:00:00+00:00
[141145 rows x 19 columns]
semijoin (flights, top_dest, on = : dest)
141145×19 DataFrame
Row │ year month day dep_time sched_dep_time dep_delay arr_time ⋯
│ Int64 Int64 Int64 Int64? Int64 Float64? Int64? ⋯
────────┼───────────────────────────────────────────────────────────────────────
1 │ 2013 1 1 542 540 2.0 923 ⋯
2 │ 2013 1 1 554 600 -6.0 812
3 │ 2013 1 1 554 558 -4.0 740
4 │ 2013 1 1 555 600 -5.0 913
5 │ 2013 1 1 557 600 -3.0 838 ⋯
6 │ 2013 1 1 558 600 -2.0 753
7 │ 2013 1 1 558 600 -2.0 924
8 │ 2013 1 1 558 600 -2.0 923
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
141139 │ 2013 9 30 2140 2140 0.0 10 ⋯
141140 │ 2013 9 30 2147 2137 10.0 30
141141 │ 2013 9 30 2149 2156 -7.0 2245
141142 │ 2013 9 30 2233 2113 80.0 112
141143 │ 2013 9 30 2235 2001 154.0 59 ⋯
141144 │ 2013 9 30 2307 2255 12.0 2359
141145 │ 2013 9 30 missing 1455 missing missing
12 columns and 141130 rows omitted
Anti-join
# Planes that are not in planes table
flights |>
anti_join (planes, by = "tailnum" ) |>
count (tailnum, sort = TRUE )
# A tibble: 722 × 2
tailnum n
<chr> <int>
1 <NA> 2512
2 N725MQ 575
3 N722MQ 513
4 N723MQ 507
5 N713MQ 483
6 N735MQ 396
7 N0EGMQ 371
8 N534MQ 364
9 N542MQ 363
10 N531MQ 349
# ℹ 712 more rows
flights.loc[- flights['tailnum' ].isin(planes['tailnum' ])].groupby('tailnum' )['tailnum' ].count().sort_values(ascending = False )
tailnum
N725MQ 575
N722MQ 513
N723MQ 507
N713MQ 483
N735MQ 396
...
N7ASAA 1
N3LFAA 1
N7ALAA 1
N7AEAA 1
N5ERAA 1
Name: tailnum, Length: 721, dtype: int64
@pipe antijoin (
flights,
planes,
on = : tailnum,
matchmissing = : notequal
) |>
groupby (_, : tailnum) |>
combine (_, nrow) |>
sort (_, : nrow, rev = true )
722×2 DataFrame
Row │ tailnum nrow
│ String? Int64
─────┼────────────────
1 │ missing 2512
2 │ N725MQ 575
3 │ N722MQ 513
4 │ N723MQ 507
5 │ N713MQ 483
6 │ N735MQ 396
7 │ N0EGMQ 371
8 │ N534MQ 364
⋮ │ ⋮ ⋮
716 │ N502SW 1
717 │ N451UW 1
718 │ N7BKAA 1
719 │ N7CAAA 1
720 │ N5FCAA 1
721 │ N5ERAA 1
722 │ N647MQ 1
707 rows omitted
Set operations
(df1 <- tribble (
~ x, ~ y,
1 , 1 ,
2 , 1
))
# A tibble: 2 × 2
x y
<dbl> <dbl>
1 1 1
2 2 1
(df2 <- tribble (
~ x, ~ y,
1 , 1 ,
1 , 2
))
# A tibble: 2 × 2
x y
<dbl> <dbl>
1 1 1
2 1 2
df1 = pd.DataFrame({
'x' : [1 , 2 ],
'y' : [1 , 1 ]
})
df1
df2 = pd.DataFrame({
'x' : [1 , 1 ],
'y' : [1 , 2 ]
})
df2
df1 = DataFrame (
x = [1 , 2 ],
y = [1 , 1 ]
)
2×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 1
df2 = DataFrame (
x = [1 , 1 ],
y = [1 , 2 ]
)
2×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
bind_rows(x, y)
stacks table x
one on top of y
.
# A tibble: 4 × 2
x y
<dbl> <dbl>
1 1 1
2 2 1
3 1 1
4 1 2
pd.concat([df1, df2], axis = 0 )
x y
0 1 1
1 2 1
0 1 1
1 1 2
4×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 1
3 │ 1 1
4 │ 1 2
intersect(x, y)
returns rows that appear in both x
and y
.
# A tibble: 1 × 2
x y
<dbl> <dbl>
1 1 1
pd.merge(df1, df2, how = 'inner' , on = ['x' , 'y' ])
DataFrame (intersect (eachrow (df1), eachrow (df2)))
1×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 1
union(x, y)
returns unique observations in x
and y
.
# A tibble: 3 × 2
x y
<dbl> <dbl>
1 1 1
2 2 1
3 1 2
pd.merge(df1, df2, how = 'outer' , on = ['x' , 'y' ])
DataFrame (union (eachrow (df1), eachrow (df2)))
3×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 1
3 │ 1 2
setdiff(x, y)
returns rows that appear in x
but not in y
.
# A tibble: 1 × 2
x y
<dbl> <dbl>
1 2 1
# A tibble: 1 × 2
x y
<dbl> <dbl>
1 1 2
Not sure how to do this elegantly.
DataFrame (setdiff (eachrow (df1), eachrow (df2)))
1×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 2 1
DataFrame (setdiff (eachrow (df2), eachrow (df1)))
1×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 2