Web Scraping

Biostat 203B

Author

Dr. Hua Zhou @ UCLA

Published

February 7, 2023

Display machine information for reproducibility.

sessionInfo()

R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.30     lifecycle_1.0.3   jsonlite_1.8.4    magrittr_2.0.3   
 [5] evaluate_0.18     rlang_1.0.6       stringi_1.7.8     cli_3.4.1        
 [9] rstudioapi_0.14   vctrs_0.5.1       rmarkdown_2.18    tools_4.2.2      
[13] stringr_1.5.0     glue_1.6.2        htmlwidgets_1.6.0 xfun_0.35        
[17] yaml_2.3.6        fastmap_1.1.0     compiler_4.2.2    htmltools_0.5.4  
[21] knitr_1.41

Load tidyverse and other packages for this lecture.

library("tidyverse")
library("rvest")
library("quantmod")

1 Web scraping

There is a wealth of data on internet. How to scrape them and analyze them?

2 rvest

rvest is an R package written by Hadley Wickham which makes web scraping easy.

3 Example: Scraping from webpage

We follow instructions in a Blog by SAURAV KAUSHIK to find the most popular feature films of 2020.
Install the SelectorGadget extension for Chrome.
The 100 most popular feature films released in 2020 can be accessed at page https://www.imdb.com/search/title/?title_type=feature&release_date=2020-01-01,2020-12-31&count=100.

# Specifying the url for desired website to be scraped
url <- "https://www.imdb.com/search/title/?title_type=feature&release_date=2020-01-01,2020-12-31&count=100"
# Reading the HTML code from the website
(webpage <- read_html(url))

{html_document}
<html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...

Suppose we want to scrape following 11 features from this page:
- Rank
- Title
- Description
- Runtime
- Genre
- Rating
- Metascore
- Votes
- Gross_Eerning_in_Mil
- Director
- Actor

3.1 Rank

Use SelectorGadget to highlight the element we want to scrape

Use the CSS selector to get the rankings

# Use CSS selectors to scrap the rankings section
(rank_data_html <- html_nodes(webpage, '.text-primary'))

{xml_nodeset (100)}
 [1] <span class="lister-item-index unbold text-primary">1.</span>
 [2] <span class="lister-item-index unbold text-primary">2.</span>
 [3] <span class="lister-item-index unbold text-primary">3.</span>
 [4] <span class="lister-item-index unbold text-primary">4.</span>
 [5] <span class="lister-item-index unbold text-primary">5.</span>
 [6] <span class="lister-item-index unbold text-primary">6.</span>
 [7] <span class="lister-item-index unbold text-primary">7.</span>
 [8] <span class="lister-item-index unbold text-primary">8.</span>
 [9] <span class="lister-item-index unbold text-primary">9.</span>
[10] <span class="lister-item-index unbold text-primary">10.</span>
[11] <span class="lister-item-index unbold text-primary">11.</span>
[12] <span class="lister-item-index unbold text-primary">12.</span>
[13] <span class="lister-item-index unbold text-primary">13.</span>
[14] <span class="lister-item-index unbold text-primary">14.</span>
[15] <span class="lister-item-index unbold text-primary">15.</span>
[16] <span class="lister-item-index unbold text-primary">16.</span>
[17] <span class="lister-item-index unbold text-primary">17.</span>
[18] <span class="lister-item-index unbold text-primary">18.</span>
[19] <span class="lister-item-index unbold text-primary">19.</span>
[20] <span class="lister-item-index unbold text-primary">20.</span>
...

# Convert the ranking data to text
(rank_data <- html_text(rank_data_html))

  [1] "1."   "2."   "3."   "4."   "5."   "6."   "7."   "8."   "9."   "10." 
 [11] "11."  "12."  "13."  "14."  "15."  "16."  "17."  "18."  "19."  "20." 
 [21] "21."  "22."  "23."  "24."  "25."  "26."  "27."  "28."  "29."  "30." 
 [31] "31."  "32."  "33."  "34."  "35."  "36."  "37."  "38."  "39."  "40." 
 [41] "41."  "42."  "43."  "44."  "45."  "46."  "47."  "48."  "49."  "50." 
 [51] "51."  "52."  "53."  "54."  "55."  "56."  "57."  "58."  "59."  "60." 
 [61] "61."  "62."  "63."  "64."  "65."  "66."  "67."  "68."  "69."  "70." 
 [71] "71."  "72."  "73."  "74."  "75."  "76."  "77."  "78."  "79."  "80." 
 [81] "81."  "82."  "83."  "84."  "85."  "86."  "87."  "88."  "89."  "90." 
 [91] "91."  "92."  "93."  "94."  "95."  "96."  "97."  "98."  "99."  "100."

# Turn into numerical values
(rank_data <- as.integer(rank_data))

  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100

3.2 Title

Use SelectorGadget to find the CSS selector .lister-item-header a.

# Using CSS selectors to scrap the title section
(title_data_html <- html_nodes(webpage, '.lister-item-header a'))

{xml_nodeset (100)}
 [1] <a href="/title/tt6723592/?ref_=adv_li_tt">Tenet</a>
 [2] <a href="/title/tt5918982/?ref_=adv_li_tt">Possessor</a>
 [3] <a href="/title/tt7923220/?ref_=adv_li_tt">Inheritance</a>
 [4] <a href="/title/tt7628504/?ref_=adv_li_tt">Megan</a>
 [5] <a href="/title/tt10886166/?ref_=adv_li_tt">365 Days</a>
 [6] <a href="/title/tt7126948/?ref_=adv_li_tt">Wonder Woman 1984</a>
 [7] <a href="/title/tt10344522/?ref_=adv_li_tt">Four Good Days</a>
 [8] <a href="/title/tt10272386/?ref_=adv_li_tt">The Father</a>
 [9] <a href="/title/tt7846844/?ref_=adv_li_tt">Enola Holmes</a>
[10] <a href="/title/tt9620292/?ref_=adv_li_tt">Promising Young Woman</a>
[11] <a href="/title/tt8503618/?ref_=adv_li_tt">Hamilton</a>
[12] <a href="/title/tt8332922/?ref_=adv_li_tt">A Quiet Place Part II</a>
[13] <a href="/title/tt7395114/?ref_=adv_li_tt">The Devil All the Time</a>
[14] <a href="/title/tt2222042/?ref_=adv_li_tt">Love and Monsters</a>
[15] <a href="/title/tt9340860/?ref_=adv_li_tt">Let Him Go</a>
[16] <a href="/title/tt7939766/?ref_=adv_li_tt">I'm Thinking of Ending Things ...
[17] <a href="/title/tt10288566/?ref_=adv_li_tt">Another Round</a>
[18] <a href="/title/tt9214832/?ref_=adv_li_tt">Emma.</a>
[19] <a href="/title/tt1502397/?ref_=adv_li_tt">Bad Boys for Life</a>
[20] <a href="/title/tt10919380/?ref_=adv_li_tt">Freaky</a>
...

# Converting the title data to text
(title_data <- html_text(title_data_html))

  [1] "Tenet"                                          
  [2] "Possessor"                                      
  [3] "Inheritance"                                    
  [4] "Megan"                                          
  [5] "365 Days"                                       
  [6] "Wonder Woman 1984"                              
  [7] "Four Good Days"                                 
  [8] "The Father"                                     
  [9] "Enola Holmes"                                   
 [10] "Promising Young Woman"                          
 [11] "Hamilton"                                       
 [12] "A Quiet Place Part II"                          
 [13] "The Devil All the Time"                         
 [14] "Love and Monsters"                              
 [15] "Let Him Go"                                     
 [16] "I'm Thinking of Ending Things"                  
 [17] "Another Round"                                  
 [18] "Emma."                                          
 [19] "Bad Boys for Life"                              
 [20] "Freaky"                                         
 [21] "Birds of Prey"                                  
 [22] "The Forgotten Battle"                           
 [23] "Nomadland"                                      
 [24] "Palm Springs"                                   
 [25] "The Dry"                                        
 [26] "Extraction"                                     
 [27] "Soul"                                           
 [28] "After We Collided"                              
 [29] "The Old Guard"                                  
 [30] "Come Play"                                      
 [31] "The Babysitter: Killer Queen"                   
 [32] "Greenland"                                      
 [33] "The Hunt"                                       
 [34] "The Midnight Sky"                               
 [35] "Ava"                                            
 [36] "Mulan"                                          
 [37] "Underwater"                                     
 [38] "Greyhound"                                      
 [39] "Black Bear"                                     
 [40] "The Empty Man"                                  
 [41] "Pieces of a Woman"                              
 [42] "Run"                                            
 [43] "The Night House"                                
 [44] "Sonic the Hedgehog"                             
 [45] "The Rental"                                     
 [46] "We Can Be Heroes"                               
 [47] "The Invisible Man"                              
 [48] "The Courier"                                    
 [49] "You Should Have Left"                           
 [50] "The New Mutants"                                
 [51] "Monster Hunter"                                 
 [52] "The Woman in the Window"                        
 [53] "Eurovision Song Contest: The Story of Fire Saga"
 [54] "Wild Mountain Thyme"                            
 [55] "The King of Staten Island"                      
 [56] "The Tax Collector"                              
 [57] "The Wrong Missy"                                
 [58] "The Trial of the Chicago 7"                     
 [59] "I Care a Lot"                                   
 [60] "The Craft: Legacy"                              
 [61] "Boss Level"                                     
 [62] "Unhinged"                                       
 [63] "Mank"                                           
 [64] "Project Power"                                  
 [65] "The Croods: A New Age"                          
 [66] "Becky"                                          
 [67] "The F**k-It List"                               
 [68] "Spenser Confidential"                           
 [69] "Shiva Baby"                                     
 [70] "Big Boys Don't Cry"                             
 [71] "The Call"                                       
 [72] "Rebecca"                                        
 [73] "The Call of the Wild"                           
 [74] "The Witches"                                    
 [75] "The Duke"                                       
 [76] "Minari"                                         
 [77] "Demon Slayer the Movie: Mugen Train"            
 [78] "Onward"                                         
 [79] "Alone"                                          
 [80] "Host"                                           
 [81] "Dolittle"                                       
 [82] "Songbird"                                       
 [83] "Hubie Halloween"                                
 [84] "Trolls World Tour"                              
 [85] "2 Hearts"                                       
 [86] "Riders of Justice"                              
 [87] "News of the World"                              
 [88] "All the Bright Places"                          
 [89] "Borat Subsequent Moviefilm"                     
 [90] "Horizon Line"                                   
 [91] "Run Sweetheart Run"                             
 [92] "Fantasy Island"                                 
 [93] "Ammonite"                                       
 [94] "Eat Wheaties!"                                  
 [95] "Holidate"                                       
 [96] "The Comeback Trail"                             
 [97] "#Alive"                                         
 [98] "Lost Girls and Love Hotels"                     
 [99] "Nowhere Special"                                
[100] "Hillbilly Elegy"

3.3 Description

# Using CSS selectors to scrap the description section
(description_data_html <- html_nodes(webpage, '.ratings-bar+ .text-muted'))

{xml_nodeset (100)}
 [1] <p class="text-muted">\nArmed with only one word, Tenet, and fighting fo ...
 [2] <p class="text-muted">\nAn agent works for a secretive organization that ...
 [3] <p class="text-muted">\nThe patriarch of a wealthy and powerful family s ...
 [4] <p class="text-muted">\nA hiker finds shelter in a mountain lodge inhabi ...
 [5] <p class="text-muted">\nMassimo is a member of the Sicilian Mafia family ...
 [6] <p class="text-muted">\nDiana must contend with a work colleague, and wi ...
 [7] <p class="text-muted">\nA mother helps her daughter work through four cr ...
 [8] <p class="text-muted">\nA man refuses all assistance from his daughter a ...
 [9] <p class="text-muted">\nWhen Enola Holmes (Sherlock's teen sister) disco ...
[10] <p class="text-muted">\nA young woman, traumatized by a tragic event in  ...
[11] <p class="text-muted">\nThe real life of one of America's foremost found ...
[12] <p class="text-muted">\nFollowing the events at home, the Abbott family  ...
[13] <p class="text-muted">\nSinister characters converge around a young man  ...
[14] <p class="text-muted">\nSeven years after he survived the monster apocal ...
[15] <p class="text-muted">\nA retired sheriff and his wife, grieving over th ...
[16] <p class="text-muted">\nFull of misgivings, a young woman travels with h ...
[17] <p class="text-muted">\nFour high-school teachers consume alcohol on a d ...
[18] <p class="text-muted">\nIn 1800s England, a well meaning but selfish you ...
[19] <p class="text-muted">\nMiami detectives Mike Lowrey and Marcus Burnett  ...
[20] <p class="text-muted">\nAfter swapping bodies with a deranged serial kil ...
...

# Converting the description data to text
description_data <- html_text(description_data_html)
# take a look at first few
head(description_data)

[1] "\nArmed with only one word, Tenet, and fighting for the survival of the entire world, a Protagonist journeys through a twilight world of international espionage on a mission that will unfold in something beyond real time."          
[2] "\nAn agent works for a secretive organization that uses brain-implant technology to inhabit other people's bodies - ultimately driving them to commit assassinations for high-paying clients."                                          
[3] "\nThe patriarch of a wealthy and powerful family suddenly passes away, leaving his daughter with a shocking secret inheritance that threatens to unravel and destroy the family."                                                       
[4] "\nA hiker finds shelter in a mountain lodge inhabited by two strange women."                                                                                                                                                            
[5] "\nMassimo is a member of the Sicilian Mafia family and Laura is a sales director. She does not expect that on a trip to Sicily trying to save her relationship, Massimo will kidnap her and give her 365 days to fall in love with him."
[6] "\nDiana must contend with a work colleague, and with a businessman whose desire for extreme wealth sends the world down a path of destruction, after an ancient artifact that grants wishes goes missing."

# strip the '\n'
description_data <- str_replace(description_data, "^\\n", "")
head(description_data)

[1] "Armed with only one word, Tenet, and fighting for the survival of the entire world, a Protagonist journeys through a twilight world of international espionage on a mission that will unfold in something beyond real time."          
[2] "An agent works for a secretive organization that uses brain-implant technology to inhabit other people's bodies - ultimately driving them to commit assassinations for high-paying clients."                                          
[3] "The patriarch of a wealthy and powerful family suddenly passes away, leaving his daughter with a shocking secret inheritance that threatens to unravel and destroy the family."                                                       
[4] "A hiker finds shelter in a mountain lodge inhabited by two strange women."                                                                                                                                                            
[5] "Massimo is a member of the Sicilian Mafia family and Laura is a sales director. She does not expect that on a trip to Sicily trying to save her relationship, Massimo will kidnap her and give her 365 days to fall in love with him."
[6] "Diana must contend with a work colleague, and with a businessman whose desire for extreme wealth sends the world down a path of destruction, after an ancient artifact that grants wishes goes missing."

3.4 Runtime

Retrieve runtime data

# Using CSS selectors to scrap the Movie runtime section
(runtime_data <- webpage %>%
  html_nodes('.runtime') %>%
  html_text() %>%
  str_replace(" min", "") %>%
  as.integer())

  [1] 150 103 111  89 114 151 100  97 123 113 160  97 138 109 113 134 117 124
 [19] 124 102 109 124 107  90 117 116 100 105 125  96 101 119  90 118  96 115
 [37]  95  91 104 137 126  90 107  99  88 100 124 112  93  94 103 100 123 102
 [55] 136  95  90 129 118  97 100  90 131 113  95  93 103 111  77  90 112 123
 [73] 100 106  95 115 117 102  98  57 101  84 103  91 101 116 118 107  95  92
 [91] 104 109 120  88 104 104  98  97  96 116

3.5 Genre

Collect the (first) genre of each movie:

genre_data <- webpage %>%
  # Using CSS selectors to scrap the Movie genre section
  html_nodes('.genre') %>%
  # Converting the genre data to text
  html_text() %>%
  # Data-Preprocessing: retrieve the first word
  str_extract("[:alpha:]+")
genre_data

  [1] "Action"    "Horror"    "Drama"     "Thriller"  "Drama"     "Action"   
  [7] "Drama"     "Drama"     "Action"    "Crime"     "Biography" "Drama"    
 [13] "Crime"     "Action"    "Crime"     "Drama"     "Comedy"    "Comedy"   
 [19] "Action"    "Comedy"    "Action"    "Drama"     "Drama"     "Comedy"   
 [25] "Crime"     "Action"    "Animation" "Drama"     "Action"    "Drama"    
 [31] "Comedy"    "Action"    "Action"    "Adventure" "Action"    "Action"   
 [37] "Action"    "Action"    "Comedy"    "Drama"     "Drama"     "Mystery"  
 [43] "Horror"    "Action"    "Drama"     "Action"    "Drama"     "Drama"    
 [49] "Horror"    "Action"    "Action"    "Crime"     "Comedy"    "Comedy"   
 [55] "Comedy"    "Action"    "Comedy"    "Drama"     "Comedy"    "Drama"    
 [61] "Action"    "Action"    "Biography" "Action"    "Animation" "Action"   
 [67] "Comedy"    "Action"    "Comedy"    "Biography" "Crime"     "Drama"    
 [73] "Adventure" "Adventure" "Biography" "Drama"     "Animation" "Animation"
 [79] "Drama"     "Horror"    "Adventure" "Drama"     "Comedy"    "Animation"
 [85] "Drama"     "Action"    "Action"    "Drama"     "Comedy"    "Action"   
 [91] "Horror"    "Fantasy"   "Biography" "Comedy"    "Comedy"    "Comedy"   
 [97] "Action"    "Drama"     "Drama"     "Drama"

3.6 Rating

Rating data:

rating_data <- webpage %>%
  html_nodes('.ratings-imdb-rating strong') %>%
  html_text() %>%
  as.numeric()
rating_data

  [1] 7.3 6.5 5.5 4.0 3.3 5.4 6.5 8.2 6.6 7.5 8.4 7.2 7.1 6.9 6.7 6.6 7.7 6.7
 [19] 6.5 6.3 6.0 7.1 7.3 7.4 6.8 6.7 8.0 5.0 6.6 5.7 5.8 6.4 6.5 5.6 5.4 5.7
 [37] 5.8 7.0 6.5 6.2 7.0 6.7 6.5 6.5 5.7 4.7 7.1 7.2 5.4 5.3 5.2 5.7 6.5 5.7
 [55] 7.1 4.8 5.7 7.7 6.3 4.5 6.8 6.0 6.8 6.0 6.9 5.9 5.2 6.2 7.1 6.8 7.1 6.0
 [73] 6.7 5.3 6.9 7.4 8.2 7.4 6.2 6.5 5.6 4.8 5.2 6.1 6.3 7.5 6.8 6.5 6.6 4.8
 [91] 5.4 4.9 6.5 6.1 6.1 5.7 6.3 4.7 7.4 6.7

3.7 Votes

Vote data

votes_data <- webpage %>%
  html_nodes('.sort-num_votes-visible span:nth-child(2)') %>%
  html_text() %>% 
  str_replace(",", "") %>% 
  as.numeric()
votes_data

  [1] 515489  38498  16675    270  91529 272157   8263 159260 197916 178206
 [11]  97436 236703 138538 130874  29406  89879 165513  56282 163610  61483
 [21] 243352  31985 164982 163411  28171 207560 335694  33277 168794  14942
 [31]  43590 120188 117075  84778  57701 150284  85425 100136  13420  31591
 [41]  51289  82674  56616 142842  33628  15440 232283  64139  22520  81971
 [51]  61116  77062  95946  10150  70509  14094  41019 180349 135068  14453
 [61]  70237  70399  76914  90844  45677  17716   7133  90566  23480    918
 [71]  34879  43284  51128  42569  10117  83854  60368 153020  21933  32248
 [81]  66091  10625  52405  24394   6264  52929  88692  33399 143296   9987
 [91]   5019  53978  19403    766  70136   9729  41618   4742   4809  43296

3.8 Director

Director information

directors_data <- webpage %>% 
  html_nodes('.text-muted+ p a:nth-child(1)') %>% 
  html_text()
directors_data

  [1] "Christopher Nolan"           "Brandon Cronenberg"         
  [3] "Vaughn Stein"                "Silvio Nacucchi"            
  [5] "Barbara Bialowas"            "Patty Jenkins"              
  [7] "Rodrigo García"              "Florian Zeller"             
  [9] "Harry Bradbeer"              "Emerald Fennell"            
 [11] "Thomas Kail"                 "John Krasinski"             
 [13] "Antonio Campos"              "Michael Matthews"           
 [15] "Thomas Bezucha"              "Charlie Kaufman"            
 [17] "Thomas Vinterberg"           "Autumn de Wilde"            
 [19] "Adil El Arbi"                "Christopher Landon"         
 [21] "Cathy Yan"                   "Matthijs van Heijningen Jr."
 [23] "Chloé Zhao"                  "Max Barbakow"               
 [25] "Robert Connolly"             "Sam Hargrave"               
 [27] "Pete Docter"                 "Roger Kumble"               
 [29] "Gina Prince-Bythewood"       "Jacob Chase"                
 [31] "McG"                         "Ric Roman Waugh"            
 [33] "Craig Zobel"                 "George Clooney"             
 [35] "Tate Taylor"                 "Niki Caro"                  
 [37] "William Eubank"              "Aaron Schneider"            
 [39] "Lawrence Michael Levine"     "David Prior"                
 [41] "Kornél Mundruczó"            "Aneesh Chaganty"            
 [43] "David Bruckner"              "Jeff Fowler"                
 [45] "Dave Franco"                 "Robert Rodriguez"           
 [47] "Leigh Whannell"              "Dominic Cooke"              
 [49] "David Koepp"                 "Josh Boone"                 
 [51] "Paul W.S. Anderson"          "Joe Wright"                 
 [53] "David Dobkin"                "John Patrick Shanley"       
 [55] "Judd Apatow"                 "David Ayer"                 
 [57] "Tyler Spindel"               "Aaron Sorkin"               
 [59] "J Blakeson"                  "Zoe Lister-Jones"           
 [61] "Joe Carnahan"                "Derrick Borte"              
 [63] "David Fincher"               "Henry Joost"                
 [65] "Joel Crawford"               "Jonathan Milott"            
 [67] "Michael Duggan"              "Peter Berg"                 
 [69] "Emma Seligman"               "Steve Crowhurst"            
 [71] "Chung-Hyun Lee"              "Ben Wheatley"               
 [73] "Chris Sanders"               "Robert Zemeckis"            
 [75] "Roger Michell"               "Lee Isaac Chung"            
 [77] "Haruo Sotozaki"              "Dan Scanlon"                
 [79] "John Hyams"                  "Rob Savage"                 
 [81] "Stephen Gaghan"              "Adam Mason"                 
 [83] "Steven Brill"                "Walt Dohrn"                 
 [85] "Lance Hool"                  "Anders Thomas Jensen"       
 [87] "Paul Greengrass"             "Brett Haley"                
 [89] "Jason Woliner"               "Mikael Marcimain"           
 [91] "Shana Feste"                 "Jeff Wadlow"                
 [93] "Francis Lee"                 "Scott Abramovitch"          
 [95] "John Whitesell"              "George Gallo"               
 [97] "Il Cho"                      "William Olsson"             
 [99] "Uberto Pasolini"             "Ron Howard"

3.9 Actor

Only the first actor

actors_data <- webpage %>%
  html_nodes('.lister-item-content .ghost+ a') %>%
  html_text()
actors_data

  [1] "John David Washington" "Andrea Riseborough"    "Lily Collins"         
  [4] "Sadie Katz"            "Anna-Maria Sieklucka"  "Gal Gadot"            
  [7] "Mila Kunis"            "Anthony Hopkins"       "Millie Bobby Brown"   
 [10] "Carey Mulligan"        "Lin-Manuel Miranda"    "Emily Blunt"          
 [13] "Bill Skarsgård"        "Dylan O'Brien"         "Diane Lane"           
 [16] "Jesse Plemons"         "Mads Mikkelsen"        "Anya Taylor-Joy"      
 [19] "Will Smith"            "Vince Vaughn"          "Margot Robbie"        
 [22] "Gijs Blom"             "Frances McDormand"     "Andy Samberg"         
 [25] "Eric Bana"             "Chris Hemsworth"       "Jamie Foxx"           
 [28] "Josephine Langford"    "Charlize Theron"       "Azhy Robertson"       
 [31] "Judah Lewis"           "Gerard Butler"         "Betty Gilpin"         
 [34] "George Clooney"        "Jessica Chastain"      "Liu Yifei"            
 [37] "Kristen Stewart"       "Tom Hanks"             "Aubrey Plaza"         
 [40] "James Badge Dale"      "Vanessa Kirby"         "Sarah Paulson"        
 [43] "Rebecca Hall"          "Ben Schwartz"          "Dan Stevens"          
 [46] "YaYa Gosselin"         "Elisabeth Moss"        "Benedict Cumberbatch" 
 [49] "Kevin Bacon"           "Maisie Williams"       "Milla Jovovich"       
 [52] "Amy Adams"             "Will Ferrell"          "Emily Blunt"          
 [55] "Pete Davidson"         "Bobby Soto"            "David Spade"          
 [58] "Eddie Redmayne"        "Rosamund Pike"         "Cailee Spaeny"        
 [61] "Frank Grillo"          "Russell Crowe"         "Gary Oldman"          
 [64] "Jamie Foxx"            "Nicolas Cage"          "Lulu Wilson"          
 [67] "Eli Brown"             "Mark Wahlberg"         "Rachel Sennott"       
 [70] "Daniel Adegboyega"     "Park Shin-Hye"         "Lily James"           
 [73] "Harrison Ford"         "Anne Hathaway"         "Jim Broadbent"        
 [76] "Steven Yeun"           "Natsuki Hanae"         "Tom Holland"          
 [79] "Jules Willcox"         "Haley Bishop"          "Robert Downey Jr."    
 [82] "K.J. Apa"              "Adam Sandler"          "Anna Kendrick"        
 [85] "Jacob Elordi"          "Mads Mikkelsen"        "Tom Hanks"            
 [88] "Elle Fanning"          "Sacha Baron Cohen"     "Allison Williams"     
 [91] "Ella Balinska"         "Michael Peña"          "Kate Winslet"         
 [94] "Tony Hale"             "Emma Roberts"          "Robert De Niro"       
 [97] "Yoo Ah-in"             "Alexandra Daddario"    "James Norton"         
[100] "Amy Adams"

3.10 Metascore

We encounter the issue of missing data when scraping metascore.
We see there are only 90 meta scores. 10 movies don’t have meta scores. We may manually find which movies don’t have meta scores but that’s tedious and not reproducible.

# Using CSS selectors to scrap the metascore section
ms_data_html <- html_nodes(webpage, '.metascore')
# Converting the runtime data to text
ms_data <- html_text(ms_data_html)
# Let's have a look at the metascore 
ms_data <- str_replace(ms_data, "\\s*$", "") %>% as.integer()
ms_data

 [1] 69 72 31 60 52 88 68 73 89 71 55 63 63 78 79 71 59 67 60 91 83 69 56 83 14
[26] 70 58 22 64 50 58 39 66 48 64 79 66 67 68 47 62 51 72 65 46 43 47 41 50 42
[51] 67 22 33 76 66 54 56 40 79 51 56 54 49 79 46 48 47 74 89 72 61 70 73 26 27
[76] 53 51 29 81 73 61 68 51 22 72 40 44 57 38

First let’s tally index and corresponding metascore (if present).

rank_and_metascore <- webpage %>%
  html_nodes('.unfavorable , .text-primary , .favorable , .mixed') %>%
  html_text() %>%
  str_replace("\\s*$", "") %>%
  print()

  [1] "1."   "69"   "2."   "72"   "3."   "31"   "4."   "5."   "6."   "60"  
 [11] "7."   "52"   "8."   "88"   "9."   "68"   "10."  "73"   "11."  "89"  
 [21] "12."  "71"   "13."  "55"   "14."  "63"   "15."  "63"   "16."  "78"  
 [31] "17."  "79"   "18."  "71"   "19."  "59"   "20."  "67"   "21."  "60"  
 [41] "22."  "23."  "91"   "24."  "83"   "25."  "69"   "26."  "56"   "27." 
 [51] "83"   "28."  "14"   "29."  "70"   "30."  "58"   "31."  "22"   "32." 
 [61] "64"   "33."  "50"   "34."  "58"   "35."  "39"   "36."  "66"   "37." 
 [71] "48"   "38."  "64"   "39."  "79"   "40."  "41."  "66"   "42."  "67"  
 [81] "43."  "68"   "44."  "47"   "45."  "62"   "46."  "51"   "47."  "72"  
 [91] "48."  "65"   "49."  "46"   "50."  "43"   "51."  "47"   "52."  "41"  
[101] "53."  "50"   "54."  "42"   "55."  "67"   "56."  "22"   "57."  "33"  
[111] "58."  "76"   "59."  "66"   "60."  "54"   "61."  "56"   "62."  "40"  
[121] "63."  "79"   "64."  "51"   "65."  "56"   "66."  "54"   "67."  "68." 
[131] "49"   "69."  "79"   "70."  "71."  "72."  "46"   "73."  "48"   "74." 
[141] "47"   "75."  "74"   "76."  "89"   "77."  "72"   "78."  "61"   "79." 
[151] "70"   "80."  "73"   "81."  "26"   "82."  "27"   "83."  "53"   "84." 
[161] "51"   "85."  "29"   "86."  "81"   "87."  "73"   "88."  "61"   "89." 
[171] "68"   "90."  "91."  "51"   "92."  "22"   "93."  "72"   "94."  "40"  
[181] "95."  "44"   "96."  "97."  "98."  "57"   "99."  "100." "38"

isrank <- str_detect(rank_and_metascore, "\\.$")
ismissing <- isrank[1:(length(rank_and_metascore) - 1)] & isrank[2:(length(rank_and_metascore))]
ismissing[length(ismissing) + 1] <- isrank[length(isrank)]
missingpos <- as.integer(rank_and_metascore[ismissing])
metascore_data <- rep(NA, 100)
metascore_data[-missingpos] <- ms_data
metascore_data

  [1] 69 72 31 NA NA 60 52 88 68 73 89 71 55 63 63 78 79 71 59 67 60 NA 91 83 69
 [26] 56 83 14 70 58 22 64 50 58 39 66 48 64 79 NA 66 67 68 47 62 51 72 65 46 43
 [51] 47 41 50 42 67 22 33 76 66 54 56 40 79 51 56 54 NA 49 79 NA NA 46 48 47 74
 [76] 89 72 61 70 73 26 27 53 51 29 81 73 61 68 NA 51 22 72 40 44 NA NA 57 NA 38

3.11 Gross

Be careful with missing data.

# Using CSS selectors to scrap the gross revenue section
gross_data_html <- html_nodes(webpage,'.ghost~ .text-muted+ span')
# Converting the gross revenue data to text
gross_data <- html_text(gross_data_html)
# Let's have a look at the gross data
gross_data

 [1] "$58.46M"  "$46.37M"  "$160.07M" "$1.07M"   "$206.31M" "$84.16M" 
 [7] "$2.39M"   "$0.50M"   "$17.29M"  "$148.97M" "$70.41M"  "$15.16M" 
[13] "$58.57M"  "$62.34M"  "$47.70M"  "$61.56M"  "$77.05M"  "$27.31M"

# Data-Preprocessing: removing '$' and 'M' signs
gross_data <- gross_data %>%
  str_replace("M", "") %>%
  str_sub(2, 10) %>%
  as.numeric()
# Let's check the length of gross data
gross_data

 [1]  58.46  46.37 160.07   1.07 206.31  84.16   2.39   0.50  17.29 148.97
[11]  70.41  15.16  58.57  62.34  47.70  61.56  77.05  27.31

82 (out of 100) movies don’t have gross data yet! We need a better way to figure out missing entries.

(rank_and_gross <- webpage %>%
   # retrieve rank and gross
   html_nodes('.ghost~ .text-muted+ span , .text-primary') %>%
   html_text() %>%
   str_replace("\\s+", "") %>%
   str_replace_all("[$M]", ""))

  [1] "1."     "58.46"  "2."     "3."     "4."     "5."     "6."     "46.37" 
  [9] "7."     "8."     "9."     "10."    "11."    "12."    "160.07" "13."   
 [17] "14."    "1.07"   "15."    "16."    "17."    "18."    "19."    "206.31"
 [25] "20."    "21."    "84.16"  "22."    "23."    "24."    "25."    "26."   
 [33] "27."    "28."    "2.39"   "29."    "30."    "31."    "32."    "33."   
 [41] "34."    "35."    "0.50"   "36."    "37."    "17.29"  "38."    "39."   
 [49] "40."    "41."    "42."    "43."    "44."    "148.97" "45."    "46."   
 [57] "47."    "70.41"  "48."    "49."    "50."    "51."    "15.16"  "52."   
 [65] "53."    "54."    "55."    "56."    "57."    "58."    "59."    "60."   
 [73] "61."    "62."    "63."    "64."    "65."    "58.57"  "66."    "67."   
 [81] "68."    "69."    "70."    "71."    "72."    "73."    "62.34"  "74."   
 [89] "75."    "76."    "77."    "47.70"  "78."    "61.56"  "79."    "80."   
 [97] "81."    "77.05"  "82."    "83."    "84."    "85."    "86."    "87."   
[105] "88."    "89."    "90."    "91."    "92."    "27.31"  "93."    "94."   
[113] "95."    "96."    "97."    "98."    "99."    "100."

isrank <- str_detect(rank_and_gross, "\\.$")
ismissing <- isrank[1:(length(rank_and_gross) - 1)] & isrank[2:(length(rank_and_gross))]
ismissing[length(ismissing)+1] <- isrank[length(isrank)]
missingpos <- as.integer(rank_and_gross[ismissing])
gs_data <- rep(NA, 100)
gs_data[-missingpos] <- gross_data
(gross_data <- gs_data)

  [1]  58.46     NA     NA     NA     NA  46.37     NA     NA     NA     NA
 [11]     NA 160.07     NA   1.07     NA     NA     NA     NA 206.31     NA
 [21]  84.16     NA     NA     NA     NA     NA     NA   2.39     NA     NA
 [31]     NA     NA     NA     NA   0.50     NA  17.29     NA     NA     NA
 [41]     NA     NA     NA 148.97     NA     NA  70.41     NA     NA     NA
 [51]  15.16     NA     NA     NA     NA     NA     NA     NA     NA     NA
 [61]     NA     NA     NA     NA  58.57     NA     NA     NA     NA     NA
 [71]     NA     NA  62.34     NA     NA     NA  47.70  61.56     NA     NA
 [81]  77.05     NA     NA     NA     NA     NA     NA     NA     NA     NA
 [91]     NA  27.31     NA     NA     NA     NA     NA     NA     NA     NA

3.12 Visualizing movie data

Form a tibble:

# Combining all the lists to form a data frame
movies <- tibble(Rank = rank_data, 
                 Title = title_data,
                 Description = description_data, 
                 Runtime = runtime_data,
                 Genre = genre_data, 
                 Rating = rating_data,
                 Metascore = metascore_data, 
                 Votes = votes_data,
                 Gross_Earning_in_Mil = gross_data,
                 Director = directors_data, 
                 Actor = actors_data)
movies %>% print(width=Inf)

# A tibble: 100 × 11
    Rank Title                
   <int> <chr>                
 1     1 Tenet                
 2     2 Possessor            
 3     3 Inheritance          
 4     4 Megan                
 5     5 365 Days             
 6     6 Wonder Woman 1984    
 7     7 Four Good Days       
 8     8 The Father           
 9     9 Enola Holmes         
10    10 Promising Young Woman
   Description                                                                  
   <chr>                                                                        
 1 Armed with only one word, Tenet, and fighting for the survival of the entire…
 2 An agent works for a secretive organization that uses brain-implant technolo…
 3 The patriarch of a wealthy and powerful family suddenly passes away, leaving…
 4 A hiker finds shelter in a mountain lodge inhabited by two strange women.    
 5 Massimo is a member of the Sicilian Mafia family and Laura is a sales direct…
 6 Diana must contend with a work colleague, and with a businessman whose desir…
 7 A mother helps her daughter work through four crucial days of recovery from …
 8 A man refuses all assistance from his daughter as he ages. As he tries to ma…
 9 When Enola Holmes (Sherlock's teen sister) discovers her mother is missing, …
10 A young woman, traumatized by a tragic event in her past, seeks out vengeanc…
   Runtime Genre    Rating Metascore  Votes Gross_Earning_in_Mil
     <int> <chr>     <dbl>     <int>  <dbl>                <dbl>
 1     150 Action      7.3        69 515489                 58.5
 2     103 Horror      6.5        72  38498                 NA  
 3     111 Drama       5.5        31  16675                 NA  
 4      89 Thriller    4          NA    270                 NA  
 5     114 Drama       3.3        NA  91529                 NA  
 6     151 Action      5.4        60 272157                 46.4
 7     100 Drama       6.5        52   8263                 NA  
 8      97 Drama       8.2        88 159260                 NA  
 9     123 Action      6.6        68 197916                 NA  
10     113 Crime       7.5        73 178206                 NA  
   Director           Actor                
   <chr>              <chr>                
 1 Christopher Nolan  John David Washington
 2 Brandon Cronenberg Andrea Riseborough   
 3 Vaughn Stein       Lily Collins         
 4 Silvio Nacucchi    Sadie Katz           
 5 Barbara Bialowas   Anna-Maria Sieklucka 
 6 Patty Jenkins      Gal Gadot            
 7 Rodrigo García     Mila Kunis           
 8 Florian Zeller     Anthony Hopkins      
 9 Harry Bradbeer     Millie Bobby Brown   
10 Emerald Fennell    Carey Mulligan       
# … with 90 more rows

How many top 100 movies are in each genre? (Be careful with interpretation.)

movies %>%
  ggplot() +
  geom_bar(mapping = aes(x = Genre))

Which genre is most profitable in terms of average gross earnings?

movies %>%
  group_by(Genre) %>%
  summarise(avg_earning = mean(Gross_Earning_in_Mil, na.rm = TRUE)) %>%
  ggplot() +
  geom_col(mapping = aes(x = Genre, y = avg_earning)) + 
  labs(y = "avg earning in millions")

Warning: Removed 6 rows containing missing values (`position_stack()`).

ggplot(data = movies) +
  geom_boxplot(mapping = aes(x = Genre, y = Gross_Earning_in_Mil)) + 
  labs(y = "Gross earning in millions")

Warning: Removed 82 rows containing non-finite values (`stat_boxplot()`).

Is there a relationship between gross earning and rating? Find the best selling movie (by gross earning) in each genre

library("ggrepel")
(best_in_genre <- movies %>%
    group_by(Genre) %>%
    filter(row_number(desc(Gross_Earning_in_Mil)) == 1)) %>%
  print(width = Inf)

# A tibble: 5 × 11
# Groups:   Genre [5]
   Rank Title                
  <int> <chr>                
1    12 A Quiet Place Part II
2    19 Bad Boys for Life    
3    78 Onward               
4    81 Dolittle             
5    92 Fantasy Island       
  Description                                                                   
  <chr>                                                                         
1 Following the events at home, the Abbott family now face the terrors of the o…
2 Miami detectives Mike Lowrey and Marcus Burnett must face off against a mothe…
3 Two elven brothers embark on a quest to bring their father back for one day.  
4 A physician who can talk to animals embarks on an adventure to find a legenda…
5 When the owner and operator of a luxurious island invites a collection of gue…
  Runtime Genre     Rating Metascore  Votes Gross_Earning_in_Mil Director      
    <int> <chr>      <dbl>     <int>  <dbl>                <dbl> <chr>         
1      97 Drama        7.2        71 236703                160.  John Krasinski
2     124 Action       6.5        59 163610                206.  Adil El Arbi  
3     102 Animation    7.4        61 153020                 61.6 Dan Scanlon   
4     101 Adventure    5.6        26  66091                 77.0 Stephen Gaghan
5     109 Fantasy      4.9        22  53978                 27.3 Jeff Wadlow   
  Actor            
  <chr>            
1 Emily Blunt      
2 Will Smith       
3 Tom Holland      
4 Robert Downey Jr.
5 Michael Peña

ggplot(movies, mapping = aes(x = Rating, y = Gross_Earning_in_Mil)) +
  geom_point(mapping = aes(size = Votes, color = Genre)) + 
  ggrepel::geom_label_repel(aes(label = Title), data = best_in_genre) +
  labs(y = "Gross earning in millions")

Warning: Removed 82 rows containing missing values (`geom_point()`).

4 Example: Scraping finance data

quantmod package contains many utility functions for retrieving and plotting finance data. E.g.,

library(quantmod)
stock <- getSymbols("TSLA", src = "yahoo", auto.assign = FALSE, from = "2020-01-01")
head(stock)

           TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
2020-01-02  28.30000  28.71333 28.11400   28.68400   142981500      28.68400
2020-01-03  29.36667  30.26667 29.12800   29.53400   266677500      29.53400
2020-01-06  29.36467  30.10400 29.33333   30.10267   151995000      30.10267
2020-01-07  30.76000  31.44200 30.22400   31.27067   268231500      31.27067
2020-01-08  31.58000  33.23267 31.21533   32.80933   467164500      32.80933
2020-01-09  33.14000  33.25333 31.52467   32.08933   426606000      32.08933

chartSeries(stock, theme = chartTheme("white"),
            type = "line", log.scale = FALSE, TA = NULL)

5 Example: Pull tweets into R

Read blog: https://towardsdatascience.com/pulling-tweets-into-r-e17d4981cfe2
twitteR package is useful for pulling tweets text data into R.

library(twitteR) #load package

Step 1: apply for a Twitter developer account. It takes some time to get approved.
Step 2: Generate and copy the Twitter App Keys.

consumer_key <- 'XXXXXXXXXX'
consumer_secret <- 'XXXXXXXXXX'
access_token <- 'XXXXXXXXXX'
access_secret <- 'XXXXXXXXXX'

Step 3. Set up authentication

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Step 4: Pull tweets

virus <- searchTwitter('#China + #Coronavirus', 
                       n = 1000, 
                       since = '2020-01-01', 
                       retryOnRateLimit = 1e3)
virus_df <- as_tibble(twListToDF(virus))
virus_df %>% print(width = Inf)