Web Scraping

Biostat 203B

Author

Dr. Hua Zhou @ UCLA

Published

February 7, 2023

Display machine information for reproducibility.

sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.30     lifecycle_1.0.3   jsonlite_1.8.4    magrittr_2.0.3   
 [5] evaluate_0.18     rlang_1.0.6       stringi_1.7.8     cli_3.4.1        
 [9] rstudioapi_0.14   vctrs_0.5.1       rmarkdown_2.18    tools_4.2.2      
[13] stringr_1.5.0     glue_1.6.2        htmlwidgets_1.6.0 xfun_0.35        
[17] yaml_2.3.6        fastmap_1.1.0     compiler_4.2.2    htmltools_0.5.4  
[21] knitr_1.41       

Load tidyverse and other packages for this lecture.

library("tidyverse")
library("rvest")
library("quantmod")

1 Web scraping

There is a wealth of data on internet. How to scrape them and analyze them?

2 rvest

rvest is an R package written by Hadley Wickham which makes web scraping easy.

3 Example: Scraping from webpage

# Specifying the url for desired website to be scraped
url <- "https://www.imdb.com/search/title/?title_type=feature&release_date=2020-01-01,2020-12-31&count=100"
# Reading the HTML code from the website
(webpage <- read_html(url))
{html_document}
<html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...
  • Suppose we want to scrape following 11 features from this page:
    • Rank
    • Title
    • Description
    • Runtime
    • Genre
    • Rating
    • Metascore
    • Votes
    • Gross_Eerning_in_Mil
    • Director
    • Actor

3.1 Rank

  • Use SelectorGadget to highlight the element we want to scrape

  • Use the CSS selector to get the rankings
# Use CSS selectors to scrap the rankings section
(rank_data_html <- html_nodes(webpage, '.text-primary'))
{xml_nodeset (100)}
 [1] <span class="lister-item-index unbold text-primary">1.</span>
 [2] <span class="lister-item-index unbold text-primary">2.</span>
 [3] <span class="lister-item-index unbold text-primary">3.</span>
 [4] <span class="lister-item-index unbold text-primary">4.</span>
 [5] <span class="lister-item-index unbold text-primary">5.</span>
 [6] <span class="lister-item-index unbold text-primary">6.</span>
 [7] <span class="lister-item-index unbold text-primary">7.</span>
 [8] <span class="lister-item-index unbold text-primary">8.</span>
 [9] <span class="lister-item-index unbold text-primary">9.</span>
[10] <span class="lister-item-index unbold text-primary">10.</span>
[11] <span class="lister-item-index unbold text-primary">11.</span>
[12] <span class="lister-item-index unbold text-primary">12.</span>
[13] <span class="lister-item-index unbold text-primary">13.</span>
[14] <span class="lister-item-index unbold text-primary">14.</span>
[15] <span class="lister-item-index unbold text-primary">15.</span>
[16] <span class="lister-item-index unbold text-primary">16.</span>
[17] <span class="lister-item-index unbold text-primary">17.</span>
[18] <span class="lister-item-index unbold text-primary">18.</span>
[19] <span class="lister-item-index unbold text-primary">19.</span>
[20] <span class="lister-item-index unbold text-primary">20.</span>
...
# Convert the ranking data to text
(rank_data <- html_text(rank_data_html))
  [1] "1."   "2."   "3."   "4."   "5."   "6."   "7."   "8."   "9."   "10." 
 [11] "11."  "12."  "13."  "14."  "15."  "16."  "17."  "18."  "19."  "20." 
 [21] "21."  "22."  "23."  "24."  "25."  "26."  "27."  "28."  "29."  "30." 
 [31] "31."  "32."  "33."  "34."  "35."  "36."  "37."  "38."  "39."  "40." 
 [41] "41."  "42."  "43."  "44."  "45."  "46."  "47."  "48."  "49."  "50." 
 [51] "51."  "52."  "53."  "54."  "55."  "56."  "57."  "58."  "59."  "60." 
 [61] "61."  "62."  "63."  "64."  "65."  "66."  "67."  "68."  "69."  "70." 
 [71] "71."  "72."  "73."  "74."  "75."  "76."  "77."  "78."  "79."  "80." 
 [81] "81."  "82."  "83."  "84."  "85."  "86."  "87."  "88."  "89."  "90." 
 [91] "91."  "92."  "93."  "94."  "95."  "96."  "97."  "98."  "99."  "100."
# Turn into numerical values
(rank_data <- as.integer(rank_data))
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100

3.2 Title

  • Use SelectorGadget to find the CSS selector .lister-item-header a.
# Using CSS selectors to scrap the title section
(title_data_html <- html_nodes(webpage, '.lister-item-header a'))
{xml_nodeset (100)}
 [1] <a href="/title/tt6723592/?ref_=adv_li_tt">Tenet</a>
 [2] <a href="/title/tt5918982/?ref_=adv_li_tt">Possessor</a>
 [3] <a href="/title/tt7923220/?ref_=adv_li_tt">Inheritance</a>
 [4] <a href="/title/tt7628504/?ref_=adv_li_tt">Megan</a>
 [5] <a href="/title/tt10886166/?ref_=adv_li_tt">365 Days</a>
 [6] <a href="/title/tt7126948/?ref_=adv_li_tt">Wonder Woman 1984</a>
 [7] <a href="/title/tt10344522/?ref_=adv_li_tt">Four Good Days</a>
 [8] <a href="/title/tt10272386/?ref_=adv_li_tt">The Father</a>
 [9] <a href="/title/tt7846844/?ref_=adv_li_tt">Enola Holmes</a>
[10] <a href="/title/tt9620292/?ref_=adv_li_tt">Promising Young Woman</a>
[11] <a href="/title/tt8503618/?ref_=adv_li_tt">Hamilton</a>
[12] <a href="/title/tt8332922/?ref_=adv_li_tt">A Quiet Place Part II</a>
[13] <a href="/title/tt7395114/?ref_=adv_li_tt">The Devil All the Time</a>
[14] <a href="/title/tt2222042/?ref_=adv_li_tt">Love and Monsters</a>
[15] <a href="/title/tt9340860/?ref_=adv_li_tt">Let Him Go</a>
[16] <a href="/title/tt7939766/?ref_=adv_li_tt">I'm Thinking of Ending Things ...
[17] <a href="/title/tt10288566/?ref_=adv_li_tt">Another Round</a>
[18] <a href="/title/tt9214832/?ref_=adv_li_tt">Emma.</a>
[19] <a href="/title/tt1502397/?ref_=adv_li_tt">Bad Boys for Life</a>
[20] <a href="/title/tt10919380/?ref_=adv_li_tt">Freaky</a>
...
# Converting the title data to text
(title_data <- html_text(title_data_html))
  [1] "Tenet"                                          
  [2] "Possessor"                                      
  [3] "Inheritance"                                    
  [4] "Megan"                                          
  [5] "365 Days"                                       
  [6] "Wonder Woman 1984"                              
  [7] "Four Good Days"                                 
  [8] "The Father"                                     
  [9] "Enola Holmes"                                   
 [10] "Promising Young Woman"                          
 [11] "Hamilton"                                       
 [12] "A Quiet Place Part II"                          
 [13] "The Devil All the Time"                         
 [14] "Love and Monsters"                              
 [15] "Let Him Go"                                     
 [16] "I'm Thinking of Ending Things"                  
 [17] "Another Round"                                  
 [18] "Emma."                                          
 [19] "Bad Boys for Life"                              
 [20] "Freaky"                                         
 [21] "Birds of Prey"                                  
 [22] "The Forgotten Battle"                           
 [23] "Nomadland"                                      
 [24] "Palm Springs"                                   
 [25] "The Dry"                                        
 [26] "Extraction"                                     
 [27] "Soul"                                           
 [28] "After We Collided"                              
 [29] "The Old Guard"                                  
 [30] "Come Play"                                      
 [31] "The Babysitter: Killer Queen"                   
 [32] "Greenland"                                      
 [33] "The Hunt"                                       
 [34] "The Midnight Sky"                               
 [35] "Ava"                                            
 [36] "Mulan"                                          
 [37] "Underwater"                                     
 [38] "Greyhound"                                      
 [39] "Black Bear"                                     
 [40] "The Empty Man"                                  
 [41] "Pieces of a Woman"                              
 [42] "Run"                                            
 [43] "The Night House"                                
 [44] "Sonic the Hedgehog"                             
 [45] "The Rental"                                     
 [46] "We Can Be Heroes"                               
 [47] "The Invisible Man"                              
 [48] "The Courier"                                    
 [49] "You Should Have Left"                           
 [50] "The New Mutants"                                
 [51] "Monster Hunter"                                 
 [52] "The Woman in the Window"                        
 [53] "Eurovision Song Contest: The Story of Fire Saga"
 [54] "Wild Mountain Thyme"                            
 [55] "The King of Staten Island"                      
 [56] "The Tax Collector"                              
 [57] "The Wrong Missy"                                
 [58] "The Trial of the Chicago 7"                     
 [59] "I Care a Lot"                                   
 [60] "The Craft: Legacy"                              
 [61] "Boss Level"                                     
 [62] "Unhinged"                                       
 [63] "Mank"                                           
 [64] "Project Power"                                  
 [65] "The Croods: A New Age"                          
 [66] "Becky"                                          
 [67] "The F**k-It List"                               
 [68] "Spenser Confidential"                           
 [69] "Shiva Baby"                                     
 [70] "Big Boys Don't Cry"                             
 [71] "The Call"                                       
 [72] "Rebecca"                                        
 [73] "The Call of the Wild"                           
 [74] "The Witches"                                    
 [75] "The Duke"                                       
 [76] "Minari"                                         
 [77] "Demon Slayer the Movie: Mugen Train"            
 [78] "Onward"                                         
 [79] "Alone"                                          
 [80] "Host"                                           
 [81] "Dolittle"                                       
 [82] "Songbird"                                       
 [83] "Hubie Halloween"                                
 [84] "Trolls World Tour"                              
 [85] "2 Hearts"                                       
 [86] "Riders of Justice"                              
 [87] "News of the World"                              
 [88] "All the Bright Places"                          
 [89] "Borat Subsequent Moviefilm"                     
 [90] "Horizon Line"                                   
 [91] "Run Sweetheart Run"                             
 [92] "Fantasy Island"                                 
 [93] "Ammonite"                                       
 [94] "Eat Wheaties!"                                  
 [95] "Holidate"                                       
 [96] "The Comeback Trail"                             
 [97] "#Alive"                                         
 [98] "Lost Girls and Love Hotels"                     
 [99] "Nowhere Special"                                
[100] "Hillbilly Elegy"                                

3.3 Description

# Using CSS selectors to scrap the description section
(description_data_html <- html_nodes(webpage, '.ratings-bar+ .text-muted'))
{xml_nodeset (100)}
 [1] <p class="text-muted">\nArmed with only one word, Tenet, and fighting fo ...
 [2] <p class="text-muted">\nAn agent works for a secretive organization that ...
 [3] <p class="text-muted">\nThe patriarch of a wealthy and powerful family s ...
 [4] <p class="text-muted">\nA hiker finds shelter in a mountain lodge inhabi ...
 [5] <p class="text-muted">\nMassimo is a member of the Sicilian Mafia family ...
 [6] <p class="text-muted">\nDiana must contend with a work colleague, and wi ...
 [7] <p class="text-muted">\nA mother helps her daughter work through four cr ...
 [8] <p class="text-muted">\nA man refuses all assistance from his daughter a ...
 [9] <p class="text-muted">\nWhen Enola Holmes (Sherlock's teen sister) disco ...
[10] <p class="text-muted">\nA young woman, traumatized by a tragic event in  ...
[11] <p class="text-muted">\nThe real life of one of America's foremost found ...
[12] <p class="text-muted">\nFollowing the events at home, the Abbott family  ...
[13] <p class="text-muted">\nSinister characters converge around a young man  ...
[14] <p class="text-muted">\nSeven years after he survived the monster apocal ...
[15] <p class="text-muted">\nA retired sheriff and his wife, grieving over th ...
[16] <p class="text-muted">\nFull of misgivings, a young woman travels with h ...
[17] <p class="text-muted">\nFour high-school teachers consume alcohol on a d ...
[18] <p class="text-muted">\nIn 1800s England, a well meaning but selfish you ...
[19] <p class="text-muted">\nMiami detectives Mike Lowrey and Marcus Burnett  ...
[20] <p class="text-muted">\nAfter swapping bodies with a deranged serial kil ...
...
# Converting the description data to text
description_data <- html_text(description_data_html)
# take a look at first few
head(description_data)
[1] "\nArmed with only one word, Tenet, and fighting for the survival of the entire world, a Protagonist journeys through a twilight world of international espionage on a mission that will unfold in something beyond real time."          
[2] "\nAn agent works for a secretive organization that uses brain-implant technology to inhabit other people's bodies - ultimately driving them to commit assassinations for high-paying clients."                                          
[3] "\nThe patriarch of a wealthy and powerful family suddenly passes away, leaving his daughter with a shocking secret inheritance that threatens to unravel and destroy the family."                                                       
[4] "\nA hiker finds shelter in a mountain lodge inhabited by two strange women."                                                                                                                                                            
[5] "\nMassimo is a member of the Sicilian Mafia family and Laura is a sales director. She does not expect that on a trip to Sicily trying to save her relationship, Massimo will kidnap her and give her 365 days to fall in love with him."
[6] "\nDiana must contend with a work colleague, and with a businessman whose desire for extreme wealth sends the world down a path of destruction, after an ancient artifact that grants wishes goes missing."                              
# strip the '\n'
description_data <- str_replace(description_data, "^\\n", "")
head(description_data)
[1] "Armed with only one word, Tenet, and fighting for the survival of the entire world, a Protagonist journeys through a twilight world of international espionage on a mission that will unfold in something beyond real time."          
[2] "An agent works for a secretive organization that uses brain-implant technology to inhabit other people's bodies - ultimately driving them to commit assassinations for high-paying clients."                                          
[3] "The patriarch of a wealthy and powerful family suddenly passes away, leaving his daughter with a shocking secret inheritance that threatens to unravel and destroy the family."                                                       
[4] "A hiker finds shelter in a mountain lodge inhabited by two strange women."                                                                                                                                                            
[5] "Massimo is a member of the Sicilian Mafia family and Laura is a sales director. She does not expect that on a trip to Sicily trying to save her relationship, Massimo will kidnap her and give her 365 days to fall in love with him."
[6] "Diana must contend with a work colleague, and with a businessman whose desire for extreme wealth sends the world down a path of destruction, after an ancient artifact that grants wishes goes missing."                              

3.4 Runtime

  • Retrieve runtime data
# Using CSS selectors to scrap the Movie runtime section
(runtime_data <- webpage %>%
  html_nodes('.runtime') %>%
  html_text() %>%
  str_replace(" min", "") %>%
  as.integer())
  [1] 150 103 111  89 114 151 100  97 123 113 160  97 138 109 113 134 117 124
 [19] 124 102 109 124 107  90 117 116 100 105 125  96 101 119  90 118  96 115
 [37]  95  91 104 137 126  90 107  99  88 100 124 112  93  94 103 100 123 102
 [55] 136  95  90 129 118  97 100  90 131 113  95  93 103 111  77  90 112 123
 [73] 100 106  95 115 117 102  98  57 101  84 103  91 101 116 118 107  95  92
 [91] 104 109 120  88 104 104  98  97  96 116

3.5 Genre

  • Collect the (first) genre of each movie:
genre_data <- webpage %>%
  # Using CSS selectors to scrap the Movie genre section
  html_nodes('.genre') %>%
  # Converting the genre data to text
  html_text() %>%
  # Data-Preprocessing: retrieve the first word
  str_extract("[:alpha:]+")
genre_data
  [1] "Action"    "Horror"    "Drama"     "Thriller"  "Drama"     "Action"   
  [7] "Drama"     "Drama"     "Action"    "Crime"     "Biography" "Drama"    
 [13] "Crime"     "Action"    "Crime"     "Drama"     "Comedy"    "Comedy"   
 [19] "Action"    "Comedy"    "Action"    "Drama"     "Drama"     "Comedy"   
 [25] "Crime"     "Action"    "Animation" "Drama"     "Action"    "Drama"    
 [31] "Comedy"    "Action"    "Action"    "Adventure" "Action"    "Action"   
 [37] "Action"    "Action"    "Comedy"    "Drama"     "Drama"     "Mystery"  
 [43] "Horror"    "Action"    "Drama"     "Action"    "Drama"     "Drama"    
 [49] "Horror"    "Action"    "Action"    "Crime"     "Comedy"    "Comedy"   
 [55] "Comedy"    "Action"    "Comedy"    "Drama"     "Comedy"    "Drama"    
 [61] "Action"    "Action"    "Biography" "Action"    "Animation" "Action"   
 [67] "Comedy"    "Action"    "Comedy"    "Biography" "Crime"     "Drama"    
 [73] "Adventure" "Adventure" "Biography" "Drama"     "Animation" "Animation"
 [79] "Drama"     "Horror"    "Adventure" "Drama"     "Comedy"    "Animation"
 [85] "Drama"     "Action"    "Action"    "Drama"     "Comedy"    "Action"   
 [91] "Horror"    "Fantasy"   "Biography" "Comedy"    "Comedy"    "Comedy"   
 [97] "Action"    "Drama"     "Drama"     "Drama"    

3.6 Rating

  • Rating data:
rating_data <- webpage %>%
  html_nodes('.ratings-imdb-rating strong') %>%
  html_text() %>%
  as.numeric()
rating_data
  [1] 7.3 6.5 5.5 4.0 3.3 5.4 6.5 8.2 6.6 7.5 8.4 7.2 7.1 6.9 6.7 6.6 7.7 6.7
 [19] 6.5 6.3 6.0 7.1 7.3 7.4 6.8 6.7 8.0 5.0 6.6 5.7 5.8 6.4 6.5 5.6 5.4 5.7
 [37] 5.8 7.0 6.5 6.2 7.0 6.7 6.5 6.5 5.7 4.7 7.1 7.2 5.4 5.3 5.2 5.7 6.5 5.7
 [55] 7.1 4.8 5.7 7.7 6.3 4.5 6.8 6.0 6.8 6.0 6.9 5.9 5.2 6.2 7.1 6.8 7.1 6.0
 [73] 6.7 5.3 6.9 7.4 8.2 7.4 6.2 6.5 5.6 4.8 5.2 6.1 6.3 7.5 6.8 6.5 6.6 4.8
 [91] 5.4 4.9 6.5 6.1 6.1 5.7 6.3 4.7 7.4 6.7

3.7 Votes

  • Vote data
votes_data <- webpage %>%
  html_nodes('.sort-num_votes-visible span:nth-child(2)') %>%
  html_text() %>% 
  str_replace(",", "") %>% 
  as.numeric()
votes_data
  [1] 515489  38498  16675    270  91529 272157   8263 159260 197916 178206
 [11]  97436 236703 138538 130874  29406  89879 165513  56282 163610  61483
 [21] 243352  31985 164982 163411  28171 207560 335694  33277 168794  14942
 [31]  43590 120188 117075  84778  57701 150284  85425 100136  13420  31591
 [41]  51289  82674  56616 142842  33628  15440 232283  64139  22520  81971
 [51]  61116  77062  95946  10150  70509  14094  41019 180349 135068  14453
 [61]  70237  70399  76914  90844  45677  17716   7133  90566  23480    918
 [71]  34879  43284  51128  42569  10117  83854  60368 153020  21933  32248
 [81]  66091  10625  52405  24394   6264  52929  88692  33399 143296   9987
 [91]   5019  53978  19403    766  70136   9729  41618   4742   4809  43296

3.8 Director

  • Director information
directors_data <- webpage %>% 
  html_nodes('.text-muted+ p a:nth-child(1)') %>% 
  html_text()
directors_data
  [1] "Christopher Nolan"           "Brandon Cronenberg"         
  [3] "Vaughn Stein"                "Silvio Nacucchi"            
  [5] "Barbara Bialowas"            "Patty Jenkins"              
  [7] "Rodrigo García"              "Florian Zeller"             
  [9] "Harry Bradbeer"              "Emerald Fennell"            
 [11] "Thomas Kail"                 "John Krasinski"             
 [13] "Antonio Campos"              "Michael Matthews"           
 [15] "Thomas Bezucha"              "Charlie Kaufman"            
 [17] "Thomas Vinterberg"           "Autumn de Wilde"            
 [19] "Adil El Arbi"                "Christopher Landon"         
 [21] "Cathy Yan"                   "Matthijs van Heijningen Jr."
 [23] "Chloé Zhao"                  "Max Barbakow"               
 [25] "Robert Connolly"             "Sam Hargrave"               
 [27] "Pete Docter"                 "Roger Kumble"               
 [29] "Gina Prince-Bythewood"       "Jacob Chase"                
 [31] "McG"                         "Ric Roman Waugh"            
 [33] "Craig Zobel"                 "George Clooney"             
 [35] "Tate Taylor"                 "Niki Caro"                  
 [37] "William Eubank"              "Aaron Schneider"            
 [39] "Lawrence Michael Levine"     "David Prior"                
 [41] "Kornél Mundruczó"            "Aneesh Chaganty"            
 [43] "David Bruckner"              "Jeff Fowler"                
 [45] "Dave Franco"                 "Robert Rodriguez"           
 [47] "Leigh Whannell"              "Dominic Cooke"              
 [49] "David Koepp"                 "Josh Boone"                 
 [51] "Paul W.S. Anderson"          "Joe Wright"                 
 [53] "David Dobkin"                "John Patrick Shanley"       
 [55] "Judd Apatow"                 "David Ayer"                 
 [57] "Tyler Spindel"               "Aaron Sorkin"               
 [59] "J Blakeson"                  "Zoe Lister-Jones"           
 [61] "Joe Carnahan"                "Derrick Borte"              
 [63] "David Fincher"               "Henry Joost"                
 [65] "Joel Crawford"               "Jonathan Milott"            
 [67] "Michael Duggan"              "Peter Berg"                 
 [69] "Emma Seligman"               "Steve Crowhurst"            
 [71] "Chung-Hyun Lee"              "Ben Wheatley"               
 [73] "Chris Sanders"               "Robert Zemeckis"            
 [75] "Roger Michell"               "Lee Isaac Chung"            
 [77] "Haruo Sotozaki"              "Dan Scanlon"                
 [79] "John Hyams"                  "Rob Savage"                 
 [81] "Stephen Gaghan"              "Adam Mason"                 
 [83] "Steven Brill"                "Walt Dohrn"                 
 [85] "Lance Hool"                  "Anders Thomas Jensen"       
 [87] "Paul Greengrass"             "Brett Haley"                
 [89] "Jason Woliner"               "Mikael Marcimain"           
 [91] "Shana Feste"                 "Jeff Wadlow"                
 [93] "Francis Lee"                 "Scott Abramovitch"          
 [95] "John Whitesell"              "George Gallo"               
 [97] "Il Cho"                      "William Olsson"             
 [99] "Uberto Pasolini"             "Ron Howard"                 

3.9 Actor

  • Only the first actor
actors_data <- webpage %>%
  html_nodes('.lister-item-content .ghost+ a') %>%
  html_text()
actors_data
  [1] "John David Washington" "Andrea Riseborough"    "Lily Collins"         
  [4] "Sadie Katz"            "Anna-Maria Sieklucka"  "Gal Gadot"            
  [7] "Mila Kunis"            "Anthony Hopkins"       "Millie Bobby Brown"   
 [10] "Carey Mulligan"        "Lin-Manuel Miranda"    "Emily Blunt"          
 [13] "Bill Skarsgård"        "Dylan O'Brien"         "Diane Lane"           
 [16] "Jesse Plemons"         "Mads Mikkelsen"        "Anya Taylor-Joy"      
 [19] "Will Smith"            "Vince Vaughn"          "Margot Robbie"        
 [22] "Gijs Blom"             "Frances McDormand"     "Andy Samberg"         
 [25] "Eric Bana"             "Chris Hemsworth"       "Jamie Foxx"           
 [28] "Josephine Langford"    "Charlize Theron"       "Azhy Robertson"       
 [31] "Judah Lewis"           "Gerard Butler"         "Betty Gilpin"         
 [34] "George Clooney"        "Jessica Chastain"      "Liu Yifei"            
 [37] "Kristen Stewart"       "Tom Hanks"             "Aubrey Plaza"         
 [40] "James Badge Dale"      "Vanessa Kirby"         "Sarah Paulson"        
 [43] "Rebecca Hall"          "Ben Schwartz"          "Dan Stevens"          
 [46] "YaYa Gosselin"         "Elisabeth Moss"        "Benedict Cumberbatch" 
 [49] "Kevin Bacon"           "Maisie Williams"       "Milla Jovovich"       
 [52] "Amy Adams"             "Will Ferrell"          "Emily Blunt"          
 [55] "Pete Davidson"         "Bobby Soto"            "David Spade"          
 [58] "Eddie Redmayne"        "Rosamund Pike"         "Cailee Spaeny"        
 [61] "Frank Grillo"          "Russell Crowe"         "Gary Oldman"          
 [64] "Jamie Foxx"            "Nicolas Cage"          "Lulu Wilson"          
 [67] "Eli Brown"             "Mark Wahlberg"         "Rachel Sennott"       
 [70] "Daniel Adegboyega"     "Park Shin-Hye"         "Lily James"           
 [73] "Harrison Ford"         "Anne Hathaway"         "Jim Broadbent"        
 [76] "Steven Yeun"           "Natsuki Hanae"         "Tom Holland"          
 [79] "Jules Willcox"         "Haley Bishop"          "Robert Downey Jr."    
 [82] "K.J. Apa"              "Adam Sandler"          "Anna Kendrick"        
 [85] "Jacob Elordi"          "Mads Mikkelsen"        "Tom Hanks"            
 [88] "Elle Fanning"          "Sacha Baron Cohen"     "Allison Williams"     
 [91] "Ella Balinska"         "Michael Peña"          "Kate Winslet"         
 [94] "Tony Hale"             "Emma Roberts"          "Robert De Niro"       
 [97] "Yoo Ah-in"             "Alexandra Daddario"    "James Norton"         
[100] "Amy Adams"            

3.10 Metascore

  • We encounter the issue of missing data when scraping metascore.

  • We see there are only 90 meta scores. 10 movies don’t have meta scores. We may manually find which movies don’t have meta scores but that’s tedious and not reproducible.

# Using CSS selectors to scrap the metascore section
ms_data_html <- html_nodes(webpage, '.metascore')
# Converting the runtime data to text
ms_data <- html_text(ms_data_html)
# Let's have a look at the metascore 
ms_data <- str_replace(ms_data, "\\s*$", "") %>% as.integer()
ms_data
 [1] 69 72 31 60 52 88 68 73 89 71 55 63 63 78 79 71 59 67 60 91 83 69 56 83 14
[26] 70 58 22 64 50 58 39 66 48 64 79 66 67 68 47 62 51 72 65 46 43 47 41 50 42
[51] 67 22 33 76 66 54 56 40 79 51 56 54 49 79 46 48 47 74 89 72 61 70 73 26 27
[76] 53 51 29 81 73 61 68 51 22 72 40 44 57 38
  • First let’s tally index and corresponding metascore (if present).
rank_and_metascore <- webpage %>%
  html_nodes('.unfavorable , .text-primary , .favorable , .mixed') %>%
  html_text() %>%
  str_replace("\\s*$", "") %>%
  print()
  [1] "1."   "69"   "2."   "72"   "3."   "31"   "4."   "5."   "6."   "60"  
 [11] "7."   "52"   "8."   "88"   "9."   "68"   "10."  "73"   "11."  "89"  
 [21] "12."  "71"   "13."  "55"   "14."  "63"   "15."  "63"   "16."  "78"  
 [31] "17."  "79"   "18."  "71"   "19."  "59"   "20."  "67"   "21."  "60"  
 [41] "22."  "23."  "91"   "24."  "83"   "25."  "69"   "26."  "56"   "27." 
 [51] "83"   "28."  "14"   "29."  "70"   "30."  "58"   "31."  "22"   "32." 
 [61] "64"   "33."  "50"   "34."  "58"   "35."  "39"   "36."  "66"   "37." 
 [71] "48"   "38."  "64"   "39."  "79"   "40."  "41."  "66"   "42."  "67"  
 [81] "43."  "68"   "44."  "47"   "45."  "62"   "46."  "51"   "47."  "72"  
 [91] "48."  "65"   "49."  "46"   "50."  "43"   "51."  "47"   "52."  "41"  
[101] "53."  "50"   "54."  "42"   "55."  "67"   "56."  "22"   "57."  "33"  
[111] "58."  "76"   "59."  "66"   "60."  "54"   "61."  "56"   "62."  "40"  
[121] "63."  "79"   "64."  "51"   "65."  "56"   "66."  "54"   "67."  "68." 
[131] "49"   "69."  "79"   "70."  "71."  "72."  "46"   "73."  "48"   "74." 
[141] "47"   "75."  "74"   "76."  "89"   "77."  "72"   "78."  "61"   "79." 
[151] "70"   "80."  "73"   "81."  "26"   "82."  "27"   "83."  "53"   "84." 
[161] "51"   "85."  "29"   "86."  "81"   "87."  "73"   "88."  "61"   "89." 
[171] "68"   "90."  "91."  "51"   "92."  "22"   "93."  "72"   "94."  "40"  
[181] "95."  "44"   "96."  "97."  "98."  "57"   "99."  "100." "38"  
isrank <- str_detect(rank_and_metascore, "\\.$")
ismissing <- isrank[1:(length(rank_and_metascore) - 1)] & isrank[2:(length(rank_and_metascore))]
ismissing[length(ismissing) + 1] <- isrank[length(isrank)]
missingpos <- as.integer(rank_and_metascore[ismissing])
metascore_data <- rep(NA, 100)
metascore_data[-missingpos] <- ms_data
metascore_data
  [1] 69 72 31 NA NA 60 52 88 68 73 89 71 55 63 63 78 79 71 59 67 60 NA 91 83 69
 [26] 56 83 14 70 58 22 64 50 58 39 66 48 64 79 NA 66 67 68 47 62 51 72 65 46 43
 [51] 47 41 50 42 67 22 33 76 66 54 56 40 79 51 56 54 NA 49 79 NA NA 46 48 47 74
 [76] 89 72 61 70 73 26 27 53 51 29 81 73 61 68 NA 51 22 72 40 44 NA NA 57 NA 38

3.11 Gross

  • Be careful with missing data.
# Using CSS selectors to scrap the gross revenue section
gross_data_html <- html_nodes(webpage,'.ghost~ .text-muted+ span')
# Converting the gross revenue data to text
gross_data <- html_text(gross_data_html)
# Let's have a look at the gross data
gross_data
 [1] "$58.46M"  "$46.37M"  "$160.07M" "$1.07M"   "$206.31M" "$84.16M" 
 [7] "$2.39M"   "$0.50M"   "$17.29M"  "$148.97M" "$70.41M"  "$15.16M" 
[13] "$58.57M"  "$62.34M"  "$47.70M"  "$61.56M"  "$77.05M"  "$27.31M" 
# Data-Preprocessing: removing '$' and 'M' signs
gross_data <- gross_data %>%
  str_replace("M", "") %>%
  str_sub(2, 10) %>%
  as.numeric()
# Let's check the length of gross data
gross_data
 [1]  58.46  46.37 160.07   1.07 206.31  84.16   2.39   0.50  17.29 148.97
[11]  70.41  15.16  58.57  62.34  47.70  61.56  77.05  27.31

82 (out of 100) movies don’t have gross data yet! We need a better way to figure out missing entries.

(rank_and_gross <- webpage %>%
   # retrieve rank and gross
   html_nodes('.ghost~ .text-muted+ span , .text-primary') %>%
   html_text() %>%
   str_replace("\\s+", "") %>%
   str_replace_all("[$M]", ""))
  [1] "1."     "58.46"  "2."     "3."     "4."     "5."     "6."     "46.37" 
  [9] "7."     "8."     "9."     "10."    "11."    "12."    "160.07" "13."   
 [17] "14."    "1.07"   "15."    "16."    "17."    "18."    "19."    "206.31"
 [25] "20."    "21."    "84.16"  "22."    "23."    "24."    "25."    "26."   
 [33] "27."    "28."    "2.39"   "29."    "30."    "31."    "32."    "33."   
 [41] "34."    "35."    "0.50"   "36."    "37."    "17.29"  "38."    "39."   
 [49] "40."    "41."    "42."    "43."    "44."    "148.97" "45."    "46."   
 [57] "47."    "70.41"  "48."    "49."    "50."    "51."    "15.16"  "52."   
 [65] "53."    "54."    "55."    "56."    "57."    "58."    "59."    "60."   
 [73] "61."    "62."    "63."    "64."    "65."    "58.57"  "66."    "67."   
 [81] "68."    "69."    "70."    "71."    "72."    "73."    "62.34"  "74."   
 [89] "75."    "76."    "77."    "47.70"  "78."    "61.56"  "79."    "80."   
 [97] "81."    "77.05"  "82."    "83."    "84."    "85."    "86."    "87."   
[105] "88."    "89."    "90."    "91."    "92."    "27.31"  "93."    "94."   
[113] "95."    "96."    "97."    "98."    "99."    "100."  
isrank <- str_detect(rank_and_gross, "\\.$")
ismissing <- isrank[1:(length(rank_and_gross) - 1)] & isrank[2:(length(rank_and_gross))]
ismissing[length(ismissing)+1] <- isrank[length(isrank)]
missingpos <- as.integer(rank_and_gross[ismissing])
gs_data <- rep(NA, 100)
gs_data[-missingpos] <- gross_data
(gross_data <- gs_data)
  [1]  58.46     NA     NA     NA     NA  46.37     NA     NA     NA     NA
 [11]     NA 160.07     NA   1.07     NA     NA     NA     NA 206.31     NA
 [21]  84.16     NA     NA     NA     NA     NA     NA   2.39     NA     NA
 [31]     NA     NA     NA     NA   0.50     NA  17.29     NA     NA     NA
 [41]     NA     NA     NA 148.97     NA     NA  70.41     NA     NA     NA
 [51]  15.16     NA     NA     NA     NA     NA     NA     NA     NA     NA
 [61]     NA     NA     NA     NA  58.57     NA     NA     NA     NA     NA
 [71]     NA     NA  62.34     NA     NA     NA  47.70  61.56     NA     NA
 [81]  77.05     NA     NA     NA     NA     NA     NA     NA     NA     NA
 [91]     NA  27.31     NA     NA     NA     NA     NA     NA     NA     NA

3.12 Visualizing movie data

  • Form a tibble:
# Combining all the lists to form a data frame
movies <- tibble(Rank = rank_data, 
                 Title = title_data,
                 Description = description_data, 
                 Runtime = runtime_data,
                 Genre = genre_data, 
                 Rating = rating_data,
                 Metascore = metascore_data, 
                 Votes = votes_data,
                 Gross_Earning_in_Mil = gross_data,
                 Director = directors_data, 
                 Actor = actors_data)
movies %>% print(width=Inf)
# A tibble: 100 × 11
    Rank Title                
   <int> <chr>                
 1     1 Tenet                
 2     2 Possessor            
 3     3 Inheritance          
 4     4 Megan                
 5     5 365 Days             
 6     6 Wonder Woman 1984    
 7     7 Four Good Days       
 8     8 The Father           
 9     9 Enola Holmes         
10    10 Promising Young Woman
   Description                                                                  
   <chr>                                                                        
 1 Armed with only one word, Tenet, and fighting for the survival of the entire…
 2 An agent works for a secretive organization that uses brain-implant technolo…
 3 The patriarch of a wealthy and powerful family suddenly passes away, leaving…
 4 A hiker finds shelter in a mountain lodge inhabited by two strange women.    
 5 Massimo is a member of the Sicilian Mafia family and Laura is a sales direct…
 6 Diana must contend with a work colleague, and with a businessman whose desir…
 7 A mother helps her daughter work through four crucial days of recovery from …
 8 A man refuses all assistance from his daughter as he ages. As he tries to ma…
 9 When Enola Holmes (Sherlock's teen sister) discovers her mother is missing, …
10 A young woman, traumatized by a tragic event in her past, seeks out vengeanc…
   Runtime Genre    Rating Metascore  Votes Gross_Earning_in_Mil
     <int> <chr>     <dbl>     <int>  <dbl>                <dbl>
 1     150 Action      7.3        69 515489                 58.5
 2     103 Horror      6.5        72  38498                 NA  
 3     111 Drama       5.5        31  16675                 NA  
 4      89 Thriller    4          NA    270                 NA  
 5     114 Drama       3.3        NA  91529                 NA  
 6     151 Action      5.4        60 272157                 46.4
 7     100 Drama       6.5        52   8263                 NA  
 8      97 Drama       8.2        88 159260                 NA  
 9     123 Action      6.6        68 197916                 NA  
10     113 Crime       7.5        73 178206                 NA  
   Director           Actor                
   <chr>              <chr>                
 1 Christopher Nolan  John David Washington
 2 Brandon Cronenberg Andrea Riseborough   
 3 Vaughn Stein       Lily Collins         
 4 Silvio Nacucchi    Sadie Katz           
 5 Barbara Bialowas   Anna-Maria Sieklucka 
 6 Patty Jenkins      Gal Gadot            
 7 Rodrigo García     Mila Kunis           
 8 Florian Zeller     Anthony Hopkins      
 9 Harry Bradbeer     Millie Bobby Brown   
10 Emerald Fennell    Carey Mulligan       
# … with 90 more rows
  • How many top 100 movies are in each genre? (Be careful with interpretation.)
movies %>%
  ggplot() +
  geom_bar(mapping = aes(x = Genre))

  • Which genre is most profitable in terms of average gross earnings?
movies %>%
  group_by(Genre) %>%
  summarise(avg_earning = mean(Gross_Earning_in_Mil, na.rm = TRUE)) %>%
  ggplot() +
  geom_col(mapping = aes(x = Genre, y = avg_earning)) + 
  labs(y = "avg earning in millions")
Warning: Removed 6 rows containing missing values (`position_stack()`).

ggplot(data = movies) +
  geom_boxplot(mapping = aes(x = Genre, y = Gross_Earning_in_Mil)) + 
  labs(y = "Gross earning in millions")
Warning: Removed 82 rows containing non-finite values (`stat_boxplot()`).

  • Is there a relationship between gross earning and rating? Find the best selling movie (by gross earning) in each genre
library("ggrepel")
(best_in_genre <- movies %>%
    group_by(Genre) %>%
    filter(row_number(desc(Gross_Earning_in_Mil)) == 1)) %>%
  print(width = Inf)
# A tibble: 5 × 11
# Groups:   Genre [5]
   Rank Title                
  <int> <chr>                
1    12 A Quiet Place Part II
2    19 Bad Boys for Life    
3    78 Onward               
4    81 Dolittle             
5    92 Fantasy Island       
  Description                                                                   
  <chr>                                                                         
1 Following the events at home, the Abbott family now face the terrors of the o…
2 Miami detectives Mike Lowrey and Marcus Burnett must face off against a mothe…
3 Two elven brothers embark on a quest to bring their father back for one day.  
4 A physician who can talk to animals embarks on an adventure to find a legenda…
5 When the owner and operator of a luxurious island invites a collection of gue…
  Runtime Genre     Rating Metascore  Votes Gross_Earning_in_Mil Director      
    <int> <chr>      <dbl>     <int>  <dbl>                <dbl> <chr>         
1      97 Drama        7.2        71 236703                160.  John Krasinski
2     124 Action       6.5        59 163610                206.  Adil El Arbi  
3     102 Animation    7.4        61 153020                 61.6 Dan Scanlon   
4     101 Adventure    5.6        26  66091                 77.0 Stephen Gaghan
5     109 Fantasy      4.9        22  53978                 27.3 Jeff Wadlow   
  Actor            
  <chr>            
1 Emily Blunt      
2 Will Smith       
3 Tom Holland      
4 Robert Downey Jr.
5 Michael Peña     
ggplot(movies, mapping = aes(x = Rating, y = Gross_Earning_in_Mil)) +
  geom_point(mapping = aes(size = Votes, color = Genre)) + 
  ggrepel::geom_label_repel(aes(label = Title), data = best_in_genre) +
  labs(y = "Gross earning in millions")
Warning: Removed 82 rows containing missing values (`geom_point()`).

4 Example: Scraping finance data

  • quantmod package contains many utility functions for retrieving and plotting finance data. E.g.,
library(quantmod)
stock <- getSymbols("TSLA", src = "yahoo", auto.assign = FALSE, from = "2020-01-01")
head(stock)
           TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
2020-01-02  28.30000  28.71333 28.11400   28.68400   142981500      28.68400
2020-01-03  29.36667  30.26667 29.12800   29.53400   266677500      29.53400
2020-01-06  29.36467  30.10400 29.33333   30.10267   151995000      30.10267
2020-01-07  30.76000  31.44200 30.22400   31.27067   268231500      31.27067
2020-01-08  31.58000  33.23267 31.21533   32.80933   467164500      32.80933
2020-01-09  33.14000  33.25333 31.52467   32.08933   426606000      32.08933
chartSeries(stock, theme = chartTheme("white"),
            type = "line", log.scale = FALSE, TA = NULL)

5 Example: Pull tweets into R

library(twitteR) #load package
  • Step 1: apply for a Twitter developer account. It takes some time to get approved.

  • Step 2: Generate and copy the Twitter App Keys.

consumer_key <- 'XXXXXXXXXX'
consumer_secret <- 'XXXXXXXXXX'
access_token <- 'XXXXXXXXXX'
access_secret <- 'XXXXXXXXXX'
  • Step 3. Set up authentication
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
  • Step 4: Pull tweets
virus <- searchTwitter('#China + #Coronavirus', 
                       n = 1000, 
                       since = '2020-01-01', 
                       retryOnRateLimit = 1e3)
virus_df <- as_tibble(twListToDF(virus))
virus_df %>% print(width = Inf)