Data Visualization With ggplot2 (and Seaborn)

Biostat 203B

Author

Dr. Hua Zhou @ UCLA

Published

January 26, 2023

Display machine information for reproducibility.

sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.30     lifecycle_1.0.3   jsonlite_1.8.4    magrittr_2.0.3   
 [5] evaluate_0.18     rlang_1.0.6       stringi_1.7.8     cli_3.4.1        
 [9] rstudioapi_0.14   vctrs_0.5.1       rmarkdown_2.18    tools_4.2.2      
[13] stringr_1.5.0     glue_1.6.2        htmlwidgets_1.6.0 xfun_0.35        
[17] yaml_2.3.6        fastmap_1.1.0     compiler_4.2.2    htmltools_0.5.4  
[21] knitr_1.41       
import IPython
print(IPython.sys_info())
{'commit_hash': 'add5877a4',
 'commit_source': 'installation',
 'default_encoding': 'utf-8',
 'ipython_path': '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/IPython',
 'ipython_version': '8.8.0',
 'os_name': 'posix',
 'platform': 'macOS-10.16-x86_64-i386-64bit',
 'sys_executable': '/Library/Frameworks/Python.framework/Versions/3.10/bin/python3',
 'sys_platform': 'darwin',
 'sys_version': '3.10.9 (v3.10.9:1dd9be6584, Dec  6 2022, 14:37:36) [Clang '
                '13.0.0 (clang-1300.0.29.30)]'}

We use the ggplot2 package in tidyverse for static visualization. The closest thing in Python is the plotnine library. But we mostly use Seaborn library, which is based on matplotlib, due to its popularity in the Python data science community. For Julia users, I recommend Makie.jl.

library(tidyverse)
# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# Set font sizes in plots
sns.set(font_scale = 1.25)
# Display all columns
pd.set_option('display.max_columns', None)

A typical data science project:

1 Data visualization

“The simple graph has brought more information to the data analyst’s mind than any other device.”

John Tukey

2 mpg data

  • mpg data is available from the ggplot2 package:
mpg %>% print(width = Inf)
# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans      drv     cty   hwy fl   
   <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr>
 1 audi         a4           1.8  1999     4 auto(l5)   f        18    29 p    
 2 audi         a4           1.8  1999     4 manual(m5) f        21    29 p    
 3 audi         a4           2    2008     4 manual(m6) f        20    31 p    
 4 audi         a4           2    2008     4 auto(av)   f        21    30 p    
 5 audi         a4           2.8  1999     6 auto(l5)   f        16    26 p    
 6 audi         a4           2.8  1999     6 manual(m5) f        18    26 p    
 7 audi         a4           3.1  2008     6 auto(av)   f        18    27 p    
 8 audi         a4 quattro   1.8  1999     4 manual(m5) 4        18    26 p    
 9 audi         a4 quattro   1.8  1999     4 auto(l5)   4        16    25 p    
10 audi         a4 quattro   2    2008     4 manual(m6) 4        20    28 p    
   class  
   <chr>  
 1 compact
 2 compact
 3 compact
 4 compact
 5 compact
 6 compact
 7 compact
 8 compact
 9 compact
10 compact
# … with 224 more rows
  • Tibbles are a generalized form of data frames, which are extensively used in tidyverse.
  • mpg data is available from the plotline package:
from plotnine.data import mpg

mpg
    manufacturer   model  displ  year  cyl       trans drv  cty  hwy fl  \
0           audi      a4    1.8  1999    4    auto(l5)   f   18   29  p   
1           audi      a4    1.8  1999    4  manual(m5)   f   21   29  p   
2           audi      a4    2.0  2008    4  manual(m6)   f   20   31  p   
3           audi      a4    2.0  2008    4    auto(av)   f   21   30  p   
4           audi      a4    2.8  1999    6    auto(l5)   f   16   26  p   
..           ...     ...    ...   ...  ...         ...  ..  ...  ... ..   
229   volkswagen  passat    2.0  2008    4    auto(s6)   f   19   28  p   
230   volkswagen  passat    2.0  2008    4  manual(m6)   f   21   29  p   
231   volkswagen  passat    2.8  1999    6    auto(l5)   f   16   26  p   
232   volkswagen  passat    2.8  1999    6  manual(m5)   f   18   26  p   
233   volkswagen  passat    3.6  2008    6    auto(s6)   f   17   26  p   

       class  
0    compact  
1    compact  
2    compact  
3    compact  
4    compact  
..       ...  
229  midsize  
230  midsize  
231  midsize  
232  midsize  
233  midsize  

[234 rows x 11 columns]
mpg.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   manufacturer  234 non-null    category
 1   model         234 non-null    category
 2   displ         234 non-null    float64 
 3   year          234 non-null    int64   
 4   cyl           234 non-null    int64   
 5   trans         234 non-null    category
 6   drv           234 non-null    category
 7   cty           234 non-null    int64   
 8   hwy           234 non-null    int64   
 9   fl            234 non-null    category
 10  class         234 non-null    category
dtypes: category(6), float64(1), int64(4)
memory usage: 13.7 KB

Note the mpg data in Seaborn is different from that in ggplot2: different number of samples and different variable namees.

sns.load_dataset("mpg").shape
(398, 9)
sns.load_dataset("mpg").info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB
  • displ: engine size, in liters.
    hwy: highway fuel efficiency, in mile per gallon (mpg).

3 Aesthetic mappings | r4ds chapter 3.3

3.1 Scatter plot

  • hwy vs displ
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

plt.figure()
sns.relplot(
  data = mpg,
  kind = "scatter",
  x = "displ",
  y = "hwy"
);
plt.show()

  • An aesthetic maps data to a specific feature of plot.

  • Check available aesthetics for a geometric object by ?geom_point.

3.2 Color of points

  • Color points according to class:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

plt.figure()
sns.relplot(
  data = mpg,
  kind = "scatter",
  x = "displ",
  y = "hwy",
  hue = "class",
  height = 8
);
plt.show()

3.3 Size of points

  • Assign different sizes to points according to class:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class))

Better to reverse the order, using size_order argument.

plt.figure()
sns.relplot(
  data = mpg,
  kind = "scatter",
  x = "displ",
  y = "hwy",
  size = "class",
  size_order = np.sort(np.unique(mpg['class']))[::-1],
  height = 8
);
plt.show()

3.4 Transparency of points

  • Assign different transparency levels to points according to class:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
Warning: Using alpha for a discrete variable is not advised.

The alpha argument in Seaborn only takes a number. I don’t know how to do it elegantly, besides stacking different levels of points.

plt.figure()
# alphas mapping to each level of class
cats = mpg['class'].unique() # levels sorted
alphas = np.linspace(0, 1, num = cats.size + 2)[1:-1]
_, ax = plt.subplots()
for cls, alpha in zip(cats, alphas):
  sns.scatterplot(
    data = mpg[mpg['class'] == cls],
    x = "displ",
    y = "hwy",
    alpha = alpha,
    ax = ax
    );
ax.legend(
  labels = cats,
  fontsize = 16
  )    
plt.show()

3.5 Shape of points (markers)

  • Assign different shapes to points according to class:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))

  • Maximum of 6 shapes at a time. By default, additional groups will go unplotted.
plt.figure()
sns.relplot(
  data = mpg,
  kind = "scatter",
  x = "displ",
  y = "hwy",
  style = "class",
  # marker size
  s = 20, 
  height = 8
);
plt.show()

3.6 Manual setting of an aesthetic

  • Set the color of all points to be blue:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

plt.figure()
sns.relplot(
  data = mpg,
  kind = "scatter",
  x = "displ",
  y = "hwy",
  color = "blue", # matplotlib argument
  height = 8
);
plt.show()

4 Facets | r4ds chapter 3.5

4.1 Facets

  • Facets divide a plot into subplots based on the values of one or more discrete variables.
  • A subplot for each car type:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

  • A subplot for each car type and drive:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ class)

  • A subplot for each car type:
plt.figure()
sns.relplot(
  data = mpg,
  kind = "scatter",
  x = "displ",
  y = "hwy",
  # Variables that define subsets to plot on different facets
  col = "class",
  col_wrap = 4
);
plt.show()

  • A subplot for each car type and drive:
sns.relplot(
  data = mpg,
  kind = "scatter",
  x = "displ",
  y = "hwy",
  # Variables that define subsets to plot on different facets
  col = "class",
  row = "drv"
);
plt.show()

5 Geometric objects | r4ds chapter 3.6

5.1 geom_smooth(): smooth line

  • hwy vs displ line:
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

We can use lmplot (figure-level function) or regplot (axes-level functions) for regression lines.

The lowess curve looks different from ggplot2. How to pass lowess parameters to the statsmodels package under the hood?

Confidence intervals cannot currently be drawn for this kind of model.

plt.figure()
sns.lmplot(
  data = mpg,
  x = "displ",
  y = "hwy",
  scatter = False,
  lowess = True
);
plt.show()

5.2 Different line types

  • Different line types according to drv:
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

I don’t know how to map line styles to a categorical variable elegantly, besides doing a dumb loop.

plt.figure()
drvs = np.sort(mpg['drv'].unique()) # levels sorted: '4', 'f', 'r'
ls = ["-", "--", "-."] # ["solid", "dashed", "dashdot"]
_, ax = plt.subplots()
for dr, ls in zip(drvs, ls):
  sns.regplot(
    data = mpg[mpg['drv'] == dr],
    x = "displ",
    y = "hwy",
    scatter = False,
    lowess = True,
    line_kws = {"ls": ls},
    ax = ax,
  )
ax.legend(
  labels = drvs,
  fontsize = 16
  )
plt.show()

5.3 Different line colors

  • Different line colors according to drv:
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))

plt.figure()
sns.lmplot(
  data = mpg,
  x = "displ",
  y = "hwy",
  hue = "drv",
  scatter = False,
  lowess = True
);
plt.show()

5.4 Points and lines (together)

  • Lines overlaid over scatter plot:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

  • Same as
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + geom_smooth()

The keyword argument scatter in the lmplot or regplot functions turns on or off scatter plot, besides the line plot

plt.figure()
sns.lmplot(
  data = mpg,
  x = "displ",
  y = "hwy",
  scatter = True,
  lowess = True
);
plt.show()

5.5 Aesthetics for each geometric object

  • Different aesthetics in different layers:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  # Different color for each class
  geom_point(mapping = aes(color = class)) + 
  # Only display the line for subcompact cars
  geom_smooth(data = mpg %>% filter(class == "subcompact"), se = FALSE)

plt.figure()
ax = sns.scatterplot(
  data = mpg,
  x = "displ",
  y = "hwy"
);
sns.regplot(
  data = mpg[mpg['class'] == "subcompact"],
  x = "displ",
  y = "hwy",
  scatter = False,
  lowess = True,
  ax = ax
);
plt.show()

6 Jitter

Jitter adds random noise to X and Y position of each element to avoid over-plotting.

  • position = "jitter" adds random noise to X and Y position of each element to avoid over-plotting:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

  • geom_jitter() is similar:
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy))

I can only use the stripplot to achieve something similar. It treats the displ variable as a categorical variable.

plt.figure()
sns.stripplot(
  data = mpg,
  x = "displ",
  y = "hwy",
  jitter = 0.5,
  size = 2.5,
  color = "black",
  native_scale = True
)
plt.show()

7 Bar plots | r4ds chapter 3.7

7.1 diamonds data

  • diamonds data:
diamonds
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# … with 53,930 more rows
from plotnine.data import diamonds

diamonds
diamonds.info()

7.2 Bar plot

  • geom_bar() creates bar chart:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

  • Bar charts, like histograms, frequency polygons, smoothers, and boxplots, plot some computed variables instead of raw data.

  • Check available computed variables for a geometric object via help:

?geom_bar
  • Use stat_count() directly:
ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))

  • stat_count() has a default geom geom_bar().

It is called countplot in Seaborn!

plt.figure()
sns.countplot(
  data = diamonds,
  x = "cut",
  # Single color
  color = "skyblue"
)
plt.show()

Or the newer histplot

plt.figure()
sns.histplot(
  data = diamonds,
  x = "cut"
)
plt.show()

Another high-level, figure-level function for displaying categorical variables is catplot.

plt.figure()
sns.catplot(
  data = diamonds,
  x = "cut",
  kind = "count"
);
plt.show()

  • Display frequency instead of counts:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = after_stat(prop), group = 1))    

Note the aesthetics mapping group=1 overwrites the default grouping (by cut) by considering all observations as a group. Without this we get

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = after_stat(prop)))

plt.figure()
sns.histplot(
  data = diamonds,
  x = "cut",
  stat = "probability",
  # shrink = .8
);
plt.show()

  • Color bar:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, colour = cut))

Not sure how to do this in Python.

  • Fill color:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))

By default, countplot is already filling different colors for levels. For single color, use the color argument.

plt.figure()
sns.countplot(
  data = diamonds,
  x = "cut"
)
plt.show()

  • Fill color according to another variable:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))

Counts don’t look right?

plt.figure()
sns.histplot(
  data = diamonds,
  x = "cut",
  hue = "clarity",
  multiple = "stack"
)
plt.show()

7.3 geom_bar() vs geom_col()

  • geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights).
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

The height of bar is the number of diamonds in each cut category.

  • geom_col() makes the heights of the bars to represent values in the data.
ggplot(data = diamonds) + 
  geom_col(mapping = aes(x = cut, y = carat))

The height of bar is total carat in each cut category.

In histplot without weights argument, bar height is the count of each category.

plt.figure()
sns.histplot(
  data = diamonds,
  x = "cut",
  weights = "carat"
)
plt.show()

histplot with weights argument set to the variable being counted/sumed.

plt.figure()
sns.histplot(
  data = diamonds,
  x = "cut",
  weights = "carat"
)
plt.show()

  • position_fill() stack elements on top of one another, normalize height:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

Set multiple to "fill" in histplot:

plt.figure()
sns.histplot(
  data = diamonds,
  x = "cut",
  hue = "clarity",
  multiple = "fill"
)
plt.show()

  • position_dodge() arrange elements side by side:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

Set multiple argument to dodge in histplot.

plt.figure()
sns.histplot(
  data = diamonds,
  x = "cut",
  hue = "clarity",
  multiple = "dodge"
)
plt.show()

  • position_stack() stack elements on top of each other:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")

Why the counts look different?

plt.figure()
sns.histplot(
  data = diamonds,
  x = "cut",
  hue = "clarity",
  multiple = "layer"
)
plt.show()

8 Box plots, violin plots

  • Recall the mpg data:
mpg
# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# … with 224 more rows
mpg

  • Boxplots (grouped by class):

Default:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

Add notches:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot(notch = TRUE)

Default:

plt.figure()
sns.boxplot(
  data = mpg,
  x = 'class',
  y = 'hwy',
)
plt.show()

Add notches:

plt.figure()
sns.boxplot(
  data = mpg,
  x = 'class',
  y = 'hwy',
  notch = True
)
plt.show()

  • Violin plots (grouped by class):
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_violin()

plt.figure()
sns.violinplot(
  data = mpg,
  x = 'class',
  y = 'hwy',
)
plt.show()

9 Coordinate systems | r4ds chapter 3.9

  • coord_cartesian() is the default Cartesian coordinate system:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() + 
  coord_cartesian(xlim = c(0, 5))

Set xlim:

plt.figure()
sns.boxplot(
  data = mpg,  
  x = "class",
  y = "hwy"
).set_xlim(-2, 7);
plt.show()

  • coord_fixed() specifies aspect ratio (x / y):
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() + 
  coord_fixed(ratio = 1/2)

catplot function accepts the aspect argument for aspect ratio.

plt.figure()
sns.catplot(
  data = mpg,  
  x = "class",
  y = "hwy",
  kind = "box",
  aspect = 0.5
)
plt.show()

  • coord_flip() flips x- and y- axis:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() + 
  coord_flip()

Just need to flip the x and y arguments! Looks much nicer.

plt.figure()
sns.catplot(
  data = mpg,  
  y = "class",
  x = "hwy",
  kind = "box"
)
plt.show()

  • Pie chart:
ggplot(data = mpg, mapping = aes(x = factor(1), fill = class)) + 
  geom_bar(width = 1) + 
  coord_polar("y")

Seaborn does not have a function for pie chart. Let’s use Pandas groupby and matplotlib.

plt.figure()
mpg.groupby("class").size().plot.pie(autopct = "%.1f%%")
plt.show()

  • A map:
library("maps")
nz <- map_data("nz")
head(nz, 20)
       long       lat group order        region subregion
1  172.7433 -34.44215     1     1 North.Island       <NA>
2  172.7983 -34.45562     1     2 North.Island       <NA>
3  172.8528 -34.44846     1     3 North.Island       <NA>
4  172.8986 -34.41786     1     4 North.Island       <NA>
5  172.9593 -34.42503     1     5 North.Island       <NA>
6  173.0184 -34.39895     1     6 North.Island       <NA>
7  173.0229 -34.44662     1     7 North.Island       <NA>
8  173.0184 -34.49343     1     8 North.Island       <NA>
9  172.9616 -34.50426     1     9 North.Island       <NA>
10 172.9181 -34.47367     1    10 North.Island       <NA>
11 172.9353 -34.52225     1    11 North.Island       <NA>
12 172.8808 -34.51504     1    12 North.Island       <NA>
13 172.9049 -34.55646     1    13 North.Island       <NA>
14 172.9553 -34.53303     1    14 North.Island       <NA>
15 172.9376 -34.57806     1    15 North.Island       <NA>
16 172.9760 -34.61227     1    16 North.Island       <NA>
17 172.9926 -34.56723     1    17 North.Island       <NA>
18 173.0218 -34.61404     1    18 North.Island       <NA>
19 173.0396 -34.65902     1    19 North.Island       <NA>
20 173.0676 -34.70044     1    20 North.Island       <NA>
ggplot(nz, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

  • coord_quickmap() puts maps in scale:
ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

10 Maps

More extensive mapping functions are provided in ggmap package in R.

library(ggmap)

# Path from LA to Yosemite
trek_df <- trek(
  from = "los angeles, california", 
  to = "yosemite, california", 
  structure = "route"
  )
qmap("california", zoom = 7) +
  geom_path(
    aes(x = lon, y = lat),
    colour = "blue",
    linewidth = 1.5, 
    alpha = .5,
    data = trek_df, 
    lineend = "round"
  )

Python users check the Cartopy package.

import cartopy.crs as ccrs

plt.figure()
ax = plt.axes(projection=ccrs.Mollweide())
ax.stock_img()
plt.show()

For interactive maps, use leaflet!

library(leaflet)

leaflet() %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addMarkers(lng = -118.44481, lat = 34.07104, popup = "Bruin")

11 Graphics for communications | r4ds chapter 28

11.1 Title

  • Figure title should be descriptive:
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE) +
  labs(title = "Fuel efficiency generally decreases with engine size")

plt.figure()
sns.relplot(
  data = mpg,
  kind = "scatter",
  x = "displ",
  y = "hwy"
).set(
  title = "Fuel efficiency generally decreases with engine size"
)
plt.show()

11.2 Subtitle and caption

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE) + 
  labs(
    title = "Fuel efficiency generally decreases with engine size",
    subtitle = "Two seaters (sports cars) are an exception because of their light weight",
    caption = "Data from fueleconomy.gov"
  )

plt.figure()
sns.relplot(
  data = mpg,
  kind = "scatter",
  x = "displ",
  y = "hwy"
).set(
  title = "Fuel efficiency generally decreases with engine size"
)
plt.suptitle("Two seaters (sports cars) are an exception because of their light weight", fontsize = 12)
plt.show()

11.3 Axis labels

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  geom_smooth(se = FALSE) +
  labs(
    x = "Engine displacement (L)",
    y = "Highway fuel economy (mpg)"
  )

plt.figure()
sns.regplot(
  data = mpg,
  x = "displ",
  y = "hwy",
  scatter = True,
  lowess = True
).set(
  xlabel = "Engine displacement (L)",
  ylabel = "Highway fuel economy (mpg)"
)
plt.show()

11.4 Math equations

df <- tibble(x = runif(10), y = runif(10))
ggplot(df, aes(x, y)) + geom_point() +
  labs(
    x = quote(sum(x[i] ^ 2, i == 1, n)),
    y = quote(alpha + beta + frac(delta, theta))
  )

  • ?plotmath
plt.figure()
df = pd.DataFrame({
  'x': np.random.rand(10),
  'y': np.random.rand(10)
})
sns.regplot(
  data = df,
  x = "x",
  y = "y"
).set(
  xlabel = r'$\sum_1^n x_i^2$',
  ylabel = r'$\alpha + \beta + \frac{\delta}{\theta}$'
)
plt.show()

11.5 Annotations

  • Find the most fuel efficient car in each car class:
best_in_class <- mpg %>%
  group_by(class) %>%
  filter(row_number(desc(hwy)) == 1)
best_in_class
# A tibble: 7 × 11
# Groups:   class [7]
  manufacturer model       displ  year   cyl trans drv     cty   hwy fl    class
  <chr>        <chr>       <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 chevrolet    corvette      5.7  1999     8 manu… r        16    26 p     2sea…
2 dodge        caravan 2wd   2.4  1999     4 auto… f        18    24 r     mini…
3 nissan       altima        2.5  2008     4 manu… f        23    32 r     mids…
4 subaru       forester a…   2.5  2008     4 manu… 4        20    27 r     suv  
5 toyota       toyota tac…   2.7  2008     4 manu… 4        17    22 r     pick…
6 volkswagen   jetta         1.9  1999     4 manu… f        33    44 d     comp…
7 volkswagen   new beetle    1.9  1999     4 manu… f        35    44 d     subc…
  • Annotate points
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(colour = class)) +
  geom_text(aes(label = model), data = best_in_class)

  • ggrepel package automatically adjusts labels so that they don’t overlap:
library("ggrepel")

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  geom_point(size = 3, shape = 1, data = best_in_class) +
  ggrepel::geom_label_repel(aes(label = model), data = best_in_class)

I don’t know easy way to annotate, besides writing a loop.

# Locate the most efficient car in each class
best_in_class = mpg.sort_values(
  by = 'hwy', 
  ascending = False
  ).groupby('class').first()
best_in_class

plt.figure()
# Regression line
sns.relplot(
  data = mpg,
  x = "displ",
  y = "hwy",
  hue = "class"
)
# Loop to add text annotation
for i in range(0, best_in_class.shape[0]):
  plt.text(
    x = best_in_class.displ[i],
    y = best_in_class.hwy[i],
    s = best_in_class.model[i]
    )
plt.show()

11.6 Scales

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class))

automatically adds scales

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()

  • breaks
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  scale_y_continuous(breaks = seq(15, 40, by = 5))

plt.figure()
sns.scatterplot(
  data = mpg,
  x = "displ",
  y = "hwy"
).set_yticks(
  np.arange(start = 15, stop = 41, step = 5)
)
plt.show()

  • labels
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  scale_x_continuous(labels = NULL) +
  scale_y_continuous(labels = NULL)

plt.figure()
ax = sns.scatterplot(
  data = mpg,
  x = "displ",
  y = "hwy"
)
ax.set_xticklabels([])
ax.set_yticklabels([])
plt.show()

  • Plot y-axis at log scale:
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  scale_y_log10()

plt.figure()
ax = sns.scatterplot(
  data = mpg,
  x = "displ",
  y = "hwy"
).set_yscale("log")
plt.show()

  • Plot x-axis in reverse order:
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  scale_x_reverse()

plt.figure()
ax = sns.scatterplot(
  data = mpg,
  x = "displ",
  y = "hwy"
).invert_xaxis()
plt.show()

11.7 Legends

  • Set legend position: "left", "right", "top", "bottom", none:
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) + 
  theme(legend.position = "left")

plt.figure()
ax = sns.scatterplot(
  data = mpg,
  x = "displ",
  y = "hwy",
  hue = "class"
)
plt.legend(loc = "upper left")
plt.show()

11.8 Zooming

  • Without clipping (calculate smoothing line using all data points)
ggplot(mpg, mapping = aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))

plt.figure()
ax = sns.regplot(
  data = mpg,
  x = "displ",
  y = "hwy",
  scatter = True,
  lowess = True,
)
ax.set_xlim(left = 5, right = 7)
ax.set_ylim(bottom = 10, top = 30)
plt.show()

  • With clipping (calculate smoothing line ignoring unseen data points)
ggplot(mpg, mapping = aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  xlim(5, 7) + ylim(10, 30)

ggplot(mpg, mapping = aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  scale_x_continuous(limits = c(5, 7)) +
  scale_y_continuous(limits = c(10, 30))

mpg %>%
  filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) %>%
  ggplot(aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth()

plt.figure()
sns.regplot(
  data = mpg[(mpg["displ"] >= 5) & (mpg["displ"] <= 7) & (mpg["hwy"] >= 10) & (mpg["hwy"] <= 30)],
  x = "displ",
  y = "hwy",
  scatter = True,
  lowess = True,
)
plt.show()

11.9 Themes

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE) +
  theme_bw()

Many options exist in the theme() function for specific customization

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE) +
  theme(
    legend.position = c(0.85, 0.85), 
    legend.key = element_blank(), 
    axis.text.x = element_text(angle = 0, size = 12), 
    axis.text.y = element_text(angle=0, size = 12), 
    axis.ticks = element_blank(), 
    legend.text=element_text(size = 8),
    panel.grid.major = element_blank(), 
    panel.border = element_blank(), 
    panel.grid.minor = element_blank(), 
    panel.background = element_blank(), 
    axis.line = element_line(color = 'black', linewidth = 0.3), 
    text = element_text(size = 13)
    )

There are five preset seaborn themes: darkgrid, whitegrid, dark, white, and ticks. They are each suited to different applications and personal preferences. The default theme is darkgrid.

sns.set_style("white")

plt.figure()
ax = sns.regplot(
  data = mpg,
  x = "displ",
  y = "hwy",
  scatter = True,
  lowess = True,
)
plt.show()

Tip

For academic papers, use the white theme in Seaborn or theme_bw in ggplot2.

11.10 Manual Colors

You may want to manually enter colors instead of relying on default colors. There is a tool to pick optimally distinct colors that is useful.

Manually select colors to use

ggplot(filter(mpg, class == "suv" | class== "compact" |
                class == "pickup" | class == "minivan"), 
       aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  theme_bw() + 
  scale_color_manual(values = c("#24aad8",
  "#cb6450",
  "#80a14b",
  "#aa65ba")) 

Manually assign labels to each color

ggplot(filter(mpg, class == "suv" | class== "compact" |
                class == "pickup" | class == "minivan"), 
       aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  theme_bw() + 
  scale_color_manual(values = c(suv = "#24aad8",
                                pickup = "#cb6450",
                                minivan = "#80a14b",
                                compact = "#aa65ba")) 

11.11 Saving plots

ggplot(mpg, aes(displ, hwy)) + geom_point()
ggsave("my-plot.pdf")
sns.scatterplot(
  data = mpg,
  x = "displ",
  y = "hwy"
).get_figure().savefig("my-plot.pdf")

12 Cheat sheet

RStudio cheat sheet is extremely helpful.