Web Scraping

Biostat 203B

Author

Dr. Hua Zhou @ UCLA

Published

February 9, 2024

Display machine information for reproducibility.

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/aarch64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/aarch64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.41        
[13] digest_0.6.33     rlang_1.1.2       evaluate_0.23    

Load tidyverse and other packages for this lecture.

library("tidyverse")
library("rvest")
library("quantmod")

1 Web scraping

There is a wealth of data on internet. How to scrape them and analyze them?

2 rvest

rvest is an R package written by Hadley Wickham which makes web scraping easy.

3 Example: Scraping from webpage

# Specifying the url for desired website to be scraped
url <- "https://www.imdb.com/search/title/?title_type=feature&release_date=2020-01-01,2020-12-31&count=100"
# Reading the HTML code from the website
(webpage <- read_html(url))
{html_document}
<html lang="en-US" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n<div>    <img height="1" width="1" style="display:none;visibility ...
  • Suppose we want to scrape following 11 features from this page:
    • Rank (popularity)
    • Title
    • Description
    • Runtime
    • Film rating
    • User rating
    • Metascore
    • Votes

3.1 Rank and title

  • Use SelectorGadget to find the CSS selector .ipc-title-link-wrapper .ipc-title__text.
# Using CSS selectors to scrap the title section
(title_data_html <- html_nodes(webpage, '.ipc-title-link-wrapper .ipc-title__text'))
{xml_nodeset (100)}
 [1] <h3 class="ipc-title__text">1. The Postcard Killings</h3>
 [2] <h3 class="ipc-title__text">2. Promising Young Woman</h3>
 [3] <h3 class="ipc-title__text">3. 365 Days</h3>
 [4] <h3 class="ipc-title__text">4. Arkansas</h3>
 [5] <h3 class="ipc-title__text">5. Tenet</h3>
 [6] <h3 class="ipc-title__text">6. The Nest</h3>
 [7] <h3 class="ipc-title__text">7. Greyhound</h3>
 [8] <h3 class="ipc-title__text">8. The Hunt</h3>
 [9] <h3 class="ipc-title__text">9. Hamilton</h3>
[10] <h3 class="ipc-title__text">10. The Dry</h3>
[11] <h3 class="ipc-title__text">11. Greenland</h3>
[12] <h3 class="ipc-title__text">12. Another Round</h3>
[13] <h3 class="ipc-title__text">13. Soul</h3>
[14] <h3 class="ipc-title__text">14. The Devil All the Time</h3>
[15] <h3 class="ipc-title__text">15. The Father</h3>
[16] <h3 class="ipc-title__text">16. Birds of Prey</h3>
[17] <h3 class="ipc-title__text">17. Emma.</h3>
[18] <h3 class="ipc-title__text">18. Sonic the Hedgehog</h3>
[19] <h3 class="ipc-title__text">19. After We Collided</h3>
[20] <h3 class="ipc-title__text">20. Love and Monsters</h3>
...
# Converting the title data to text
(ranktitle_data <- html_text(title_data_html))
  [1] "1. The Postcard Killings"                                   
  [2] "2. Promising Young Woman"                                   
  [3] "3. 365 Days"                                                
  [4] "4. Arkansas"                                                
  [5] "5. Tenet"                                                   
  [6] "6. The Nest"                                                
  [7] "7. Greyhound"                                               
  [8] "8. The Hunt"                                                
  [9] "9. Hamilton"                                                
 [10] "10. The Dry"                                                
 [11] "11. Greenland"                                              
 [12] "12. Another Round"                                          
 [13] "13. Soul"                                                   
 [14] "14. The Devil All the Time"                                 
 [15] "15. The Father"                                             
 [16] "16. Birds of Prey"                                          
 [17] "17. Emma."                                                  
 [18] "18. Sonic the Hedgehog"                                     
 [19] "19. After We Collided"                                      
 [20] "20. Love and Monsters"                                      
 [21] "21. Trolls World Tour"                                      
 [22] "22. Palm Springs"                                           
 [23] "23. Enola Holmes"                                           
 [24] "24. I Care a Lot"                                           
 [25] "25. I'm Thinking of Ending Things"                          
 [26] "26. Nomadland"                                              
 [27] "27. The Old Guard"                                          
 [28] "28. Underwater"                                             
 [29] "29. Ava"                                                    
 [30] "30. Alone"                                                  
 [31] "31. Extraction"                                             
 [32] "32. A Quiet Place Part II"                                  
 [33] "33. Minari"                                                 
 [34] "34. The Invisible Man"                                      
 [35] "35. The Unhealer"                                           
 [36] "36. Wonder Woman 1984"                                      
 [37] "37. Run"                                                    
 [38] "38. Run Hide Fight"                                         
 [39] "39. Pieces of a Woman"                                      
 [40] "40. Relic"                                                  
 [41] "41. The Empty Man"                                          
 [42] "42. The New Mutants"                                        
 [43] "43. The Call of the Wild"                                   
 [44] "44. Mulan"                                                  
 [45] "45. The Silencing"                                          
 [46] "46. The Courier"                                            
 [47] "47. Shiva Baby"                                             
 [48] "48. Onward"                                                 
 [49] "49. The Call"                                               
 [50] "50. The Last Champion"                                      
 [51] "51. Eurovision Song Contest: The Story of Fire Saga"        
 [52] "52. Riders of Justice"                                      
 [53] "53. The Trial of the Chicago 7"                             
 [54] "54. The Witches"                                            
 [55] "55. Dolittle"                                               
 [56] "56. Possessor"                                              
 [57] "57. Freaky"                                                 
 [58] "58. Spenser Confidential"                                   
 [59] "59. Bad Boys for Life"                                      
 [60] "60. The Midnight Sky"                                       
 [61] "61. Rebecca"                                                
 [62] "62. Zola"                                                   
 [63] "63. The Wrong Missy"                                        
 [64] "64. The Forgotten Battle"                                   
 [65] "65. You Should Have Left"                                   
 [66] "66. Unhinged"                                               
 [67] "67. The Secret: Dare to Dream"                              
 [68] "68. The Tax Collector"                                      
 [69] "69. Lost Girls and Love Hotels"                             
 [70] "70. The Croods: A New Age"                                  
 [71] "71. The Banker"                                             
 [72] "72. The Night House"                                        
 [73] "73. Mank"                                                   
 [74] "74. Fantasy Island"                                         
 [75] "75. The Rental"                                             
 [76] "76. The King of Staten Island"                              
 [77] "77. The Dark and the Wicked"                                
 [78] "78. Borat Subsequent Moviefilm"                             
 [79] "79. The World to Come"                                      
 [80] "80. Monster Hunter"                                         
 [81] "81. We Can Be Heroes"                                       
 [82] "82. Peninsula"                                              
 [83] "83. Boss Level"                                             
 [84] "84. Palm Swings"                                            
 [85] "85. Hillbilly Elegy"                                        
 [86] "86. News of the World"                                      
 [87] "87. Gretel & Hansel"                                        
 [88] "88. Finding You"                                            
 [89] "89. Nocturne"                                               
 [90] "90. Ammonite"                                               
 [91] "91. Body Cam"                                               
 [92] "92. Inheritance"                                            
 [93] "93. The Babysitter: Killer Queen"                           
 [94] "94. Bloodshot"                                              
 [95] "95. Demon Slayer: Kimetsu no Yaiba - The Movie: Mugen Train"
 [96] "96. Project Power"                                          
 [97] "97. Becky"                                                  
 [98] "98. French Exit"                                            
 [99] "99. Simple Passion"                                         
[100] "100. #Alive"                                                
# rank
rank_data <- str_extract(ranktitle_data, "^[0-9]+") |> as.integer() |> print()
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100
# title
title_data <- str_remove(ranktitle_data, "^[0-9]+. ") |> print()   
  [1] "The Postcard Killings"                                  
  [2] "Promising Young Woman"                                  
  [3] "365 Days"                                               
  [4] "Arkansas"                                               
  [5] "Tenet"                                                  
  [6] "The Nest"                                               
  [7] "Greyhound"                                              
  [8] "The Hunt"                                               
  [9] "Hamilton"                                               
 [10] "The Dry"                                                
 [11] "Greenland"                                              
 [12] "Another Round"                                          
 [13] "Soul"                                                   
 [14] "The Devil All the Time"                                 
 [15] "The Father"                                             
 [16] "Birds of Prey"                                          
 [17] "Emma."                                                  
 [18] "Sonic the Hedgehog"                                     
 [19] "After We Collided"                                      
 [20] "Love and Monsters"                                      
 [21] "Trolls World Tour"                                      
 [22] "Palm Springs"                                           
 [23] "Enola Holmes"                                           
 [24] "I Care a Lot"                                           
 [25] "I'm Thinking of Ending Things"                          
 [26] "Nomadland"                                              
 [27] "The Old Guard"                                          
 [28] "Underwater"                                             
 [29] "Ava"                                                    
 [30] "Alone"                                                  
 [31] "Extraction"                                             
 [32] "A Quiet Place Part II"                                  
 [33] "Minari"                                                 
 [34] "The Invisible Man"                                      
 [35] "The Unhealer"                                           
 [36] "Wonder Woman 1984"                                      
 [37] "Run"                                                    
 [38] "Run Hide Fight"                                         
 [39] "Pieces of a Woman"                                      
 [40] "Relic"                                                  
 [41] "The Empty Man"                                          
 [42] "The New Mutants"                                        
 [43] "The Call of the Wild"                                   
 [44] "Mulan"                                                  
 [45] "The Silencing"                                          
 [46] "The Courier"                                            
 [47] "Shiva Baby"                                             
 [48] "Onward"                                                 
 [49] "The Call"                                               
 [50] "The Last Champion"                                      
 [51] "Eurovision Song Contest: The Story of Fire Saga"        
 [52] "Riders of Justice"                                      
 [53] "The Trial of the Chicago 7"                             
 [54] "The Witches"                                            
 [55] "Dolittle"                                               
 [56] "Possessor"                                              
 [57] "Freaky"                                                 
 [58] "Spenser Confidential"                                   
 [59] "Bad Boys for Life"                                      
 [60] "The Midnight Sky"                                       
 [61] "Rebecca"                                                
 [62] "Zola"                                                   
 [63] "The Wrong Missy"                                        
 [64] "The Forgotten Battle"                                   
 [65] "You Should Have Left"                                   
 [66] "Unhinged"                                               
 [67] "The Secret: Dare to Dream"                              
 [68] "The Tax Collector"                                      
 [69] "Lost Girls and Love Hotels"                             
 [70] "The Croods: A New Age"                                  
 [71] "The Banker"                                             
 [72] "The Night House"                                        
 [73] "Mank"                                                   
 [74] "Fantasy Island"                                         
 [75] "The Rental"                                             
 [76] "The King of Staten Island"                              
 [77] "The Dark and the Wicked"                                
 [78] "Borat Subsequent Moviefilm"                             
 [79] "The World to Come"                                      
 [80] "Monster Hunter"                                         
 [81] "We Can Be Heroes"                                       
 [82] "Peninsula"                                              
 [83] "Boss Level"                                             
 [84] "Palm Swings"                                            
 [85] "Hillbilly Elegy"                                        
 [86] "News of the World"                                      
 [87] "Gretel & Hansel"                                        
 [88] "Finding You"                                            
 [89] "Nocturne"                                               
 [90] "Ammonite"                                               
 [91] "Body Cam"                                               
 [92] "Inheritance"                                            
 [93] "The Babysitter: Killer Queen"                           
 [94] "Bloodshot"                                              
 [95] "Demon Slayer: Kimetsu no Yaiba - The Movie: Mugen Train"
 [96] "Project Power"                                          
 [97] "Becky"                                                  
 [98] "French Exit"                                            
 [99] "Simple Passion"                                         
[100] "#Alive"                                                 

3.2 Description

# Using CSS selectors to scrap the description section
(description_data_html <- html_nodes(webpage, '.ipc-html-content-inner-div'))
{xml_nodeset (100)}
 [1] <div class="ipc-html-content-inner-div">A New York detective investigate ...
 [2] <div class="ipc-html-content-inner-div">A young woman, traumatized by a  ...
 [3] <div class="ipc-html-content-inner-div">Massimo is a member of the Sicil ...
 [4] <div class="ipc-html-content-inner-div">Kyle and Swin live by the orders ...
 [5] <div class="ipc-html-content-inner-div">Armed with only one word, Tenet, ...
 [6] <div class="ipc-html-content-inner-div">Life for an entrepreneur and his ...
 [7] <div class="ipc-html-content-inner-div">Several months after the U.S. en ...
 [8] <div class="ipc-html-content-inner-div">Twelve strangers wake up in a cl ...
 [9] <div class="ipc-html-content-inner-div">The real life of one of America' ...
[10] <div class="ipc-html-content-inner-div">Aaron Falk returns to his drough ...
[11] <div class="ipc-html-content-inner-div">A family struggles for survival  ...
[12] <div class="ipc-html-content-inner-div">Four high-school teachers consum ...
[13] <div class="ipc-html-content-inner-div">After landing the gig of a lifet ...
[14] <div class="ipc-html-content-inner-div">Sinister characters converge aro ...
[15] <div class="ipc-html-content-inner-div">A man refuses all assistance fro ...
[16] <div class="ipc-html-content-inner-div">After splitting with the Joker,  ...
[17] <div class="ipc-html-content-inner-div">In 1800s England, a well meaning ...
[18] <div class="ipc-html-content-inner-div">After discovering a small, blue, ...
[19] <div class="ipc-html-content-inner-div">Based on the 2014 romance novel  ...
[20] <div class="ipc-html-content-inner-div">Seven years after he survived th ...
...
# Converting the description data to text
description_data <- html_text(description_data_html)
# take a look at first few
head(description_data)
[1] "A New York detective investigates the death of his daughter who was murdered while on her honeymoon in London; he recruits the help of a Scandinavian journalist when other couples throughout Europe suffer a similar fate."         
[2] "A young woman, traumatized by a tragic event in her past, seeks out vengeance against those who crossed her path."                                                                                                                    
[3] "Massimo is a member of the Sicilian Mafia family and Laura is a sales director. She does not expect that on a trip to Sicily trying to save her relationship, Massimo will kidnap her and give her 365 days to fall in love with him."
[4] "Kyle and Swin live by the orders of an Arkansas-based drug kingpin named Frog, whom they've never met. But when a deal goes horribly wrong, the consequences are deadly."                                                             
[5] "Armed with only one word, Tenet, and fighting for the survival of the entire world, a Protagonist journeys through a twilight world of international espionage on a mission that will unfold in something beyond real time."          
[6] "Life for an entrepreneur and his American family begins to take a twisted turn after moving into an English country manor."                                                                                                           

3.3 Runtime

  • Retrieve runtime strings
# Using CSS selectors to scrap the Movie runtime section
runtime_text <- webpage |>
  html_nodes('.dli-title-metadata-item:nth-child(2)') |>
  html_text() |>
  print()
  [1] "1h 44m" "1h 53m" "1h 54m" "1h 57m" "2h 30m" "1h 47m" "1h 31m" "1h 30m"
  [9] "2h 40m" "1h 57m" "1h 59m" "1h 57m" "1h 40m" "2h 18m" "1h 37m" "1h 49m"
 [17] "2h 4m"  "1h 39m" "1h 45m" "1h 49m" "1h 31m" "1h 30m" "2h 3m"  "1h 58m"
 [25] "2h 14m" "1h 47m" "2h 5m"  "1h 35m" "1h 36m" "1h 38m" "1h 56m" "1h 37m"
 [33] "1h 55m" "2h 4m"  "1h 34m" "2h 31m" "1h 30m" "1h 49m" "2h 6m"  "1h 29m"
 [41] "2h 17m" "1h 34m" "1h 40m" "1h 55m" "1h 33m" "1h 52m" "1h 17m" "1h 42m"
 [49] "1h 52m" "2h 2m"  "2h 3m"  "1h 56m" "2h 9m"  "1h 46m" "1h 41m" "1h 43m"
 [57] "1h 42m" "1h 51m" "2h 4m"  "1h 58m" "2h 3m"  "1h 26m" "1h 30m" "2h 4m" 
 [65] "1h 33m" "1h 30m" "1h 47m" "1h 35m" "1h 37m" "1h 35m" "2h"     "1h 47m"
 [73] "2h 11m" "1h 49m" "1h 28m" "2h 16m" "1h 35m" "1h 35m" "1h 45m" "1h 43m"
 [81] "1h 40m" "1h 56m" "1h 40m" "1h 35m" "1h 56m" "1h 58m" "1h 27m" "1h 59m"
 [89] "1h 30m" "1h 57m" "1h 36m" "1h 51m" "1h 41m" "1h 49m" "1h 57m" "1h 53m"
 [97] "1h 33m" "1h 53m" "1h 39m" "1h 38m"
  • Hours and minutes:
# hours
runtime_hour <- runtime_text |>
  str_extract("\\d+(?=h)") |>
  as.integer() |>
  print()
  [1] 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 2 1 2 1
 [38] 1 2 1 2 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1 1 1 2 1 2 1
 [75] 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# minutes
runtime_min <- runtime_text |>
  str_extract("\\d+(?=m)") |>
  # replace NA by 0
  str_replace_na("0") |>
  as.integer() |>
  print()
  [1] 44 53 54 57 30 47 31 30 40 57 59 57 40 18 37 49  4 39 45 49 31 30  3 58 14
 [26] 47  5 35 36 38 56 37 55  4 34 31 30 49  6 29 17 34 40 55 33 52 17 42 52  2
 [51]  3 56  9 46 41 43 42 51  4 58  3 26 30  4 33 30 47 35 37 35  0 47 11 49 28
 [76] 16 35 35 45 43 40 56 40 35 56 58 27 59 30 57 36 51 41 49 57 53 33 53 39 38
  • Runtime in minutes
runtime_data <- (runtime_hour * 60 + runtime_min) |> print()
  [1] 104 113 114 117 150 107  91  90 160 117 119 117 100 138  97 109 124  99
 [19] 105 109  91  90 123 118 134 107 125  95  96  98 116  97 115 124  94 151
 [37]  90 109 126  89 137  94 100 115  93 112  77 102 112 122 123 116 129 106
 [55] 101 103 102 111 124 118 123  86  90 124  93  90 107  95  97  95 120 107
 [73] 131 109  88 136  95  95 105 103 100 116 100  95 116 118  87 119  90 117
 [91]  96 111 101 109 117 113  93 113  99  98

3.4 Film rating

  • Film rating:
filmrating_data <- webpage |>
  html_nodes('.dli-title-metadata-item:nth-child(3)') %>%
  html_text() |>
  str_replace("Unrated", "Not Rated") |>
  print()
  [1] "Not Rated" "R"         "TV-MA"     "R"         "PG-13"     "R"        
  [7] "PG-13"     "R"         "PG-13"     "R"         "PG-13"     "Not Rated"
 [13] "PG"        "R"         "PG-13"     "R"         "PG"        "PG"       
 [19] "R"         "PG-13"     "PG"        "R"         "PG-13"     "R"        
 [25] "R"         "R"         "R"         "PG-13"     "R"         "R"        
 [31] "R"         "PG-13"     "PG-13"     "R"         "Not Rated" "PG-13"    
 [37] "PG-13"     "TV-MA"     "R"         "R"         "R"         "PG-13"    
 [43] "PG"        "PG-13"     "R"         "PG-13"     "Not Rated" "PG"       
 [49] "TV-MA"     "PG-13"     "PG-13"     "Not Rated" "R"         "PG"       
 [55] "PG"        "R"         "R"         "R"         "R"         "PG-13"    
 [61] "PG-13"     "R"         "TV-MA"     "TV-MA"     "R"         "R"        
 [67] "PG"        "Not Rated" "R"         "PG"        "PG-13"     "R"        
 [73] "R"         "PG-13"     "R"         "R"         "Not Rated" "R"        
 [79] "R"         "PG-13"     "PG"        "Not Rated" "TV-MA"     "Not Rated"
 [85] "R"         "PG-13"     "PG-13"     "PG"        "Not Rated" "R"        
 [91] "R"         "TV-MA"     "TV-MA"     "PG-13"     "TV-MA"     "R"        
 [97] "R"         "R"         "Not Rated" "TV-MA"    

3.5 Votes

  • Vote data
votes_data <- webpage |>
  html_nodes('.cPpOqU') |>
  html_text() |>
  str_remove("Votes") |>
  str_remove(",") |>
  as.numeric() |>
  print()
  [1]  14773 207750  97303  14677 579532  17658 111867 126710 110869  31996
 [11] 130963 190323 372388 150567 187570 261927  61465 156777  37501 143110
 [21]  28179 179394 215377 144000  98611 180085 181328  95466  61911  26367
 [31] 258912 267260  94833 251474   3374 289146  93941  26829  55611  29743
 [41]  37468  88080  55663 157830  32942  72092  31838 167311  42391    718
 [51] 101354  61975 191339  46055  70390  44851  71040  96540 173184  88598
 [61]  46054  17034  44660  36455  25288  75799  17226  15199   5270  52042
 [71]  34551  65665  81926  57227  37437  75873  21830 152043  10619  67178
 [81]  16902  37237  78034   1242  45878  94757  33651   8349  11331  22625
 [91]   7283  18786  49915  83711  70992  96719  24031   8644   1042  46885

3.6 User rating

  • User rating:
userrating_data <- webpage |>
  html_nodes('.ratingGroup--imdb-rating') |>
  html_text() |>
  str_extract("^\\d+\\.\\d+") |>
  as.numeric() |>
  print()
  [1] 5.8 7.5 3.3 6.0 7.3 6.3 7.0 6.5 8.3 6.9 6.4 7.7 8.0 7.1 8.2 6.1 6.7 6.5
 [19] 5.0 6.9 6.1 7.4 6.6 6.4 6.6 7.3 6.7 5.9 5.5 6.2 6.8 7.2 7.4 7.1 5.6 5.4
 [37] 6.7 6.3 7.0 6.0 6.2 5.3 6.7 5.8 6.3 7.2 7.1 7.4 7.1 6.9 6.5 7.5 7.7 5.4
 [55] 5.6 6.5 6.3 6.2 6.5 5.7 6.0 6.5 5.7 7.1 5.4 6.0 6.5 4.8 4.7 6.9 7.3 6.5
 [73] 6.8 4.9 5.7 7.1 6.1 6.6 6.4 5.2 4.7 5.5 6.8 4.4 6.7 6.8 5.5 6.4 5.7 6.5
 [91] 5.3 5.6 5.8 5.7 8.2 6.0 6.0 5.9 5.4 6.3

3.7 Metascore

  • We encounter the issue of missing data when scraping metascore.

  • We see there are only 91 meta scores. 9 movies don’t have meta scores. We may manually find which movies don’t have meta scores but that’s tedious and not reproducible.

# Using CSS selectors to scrap the metascore section
ms_data <- html_nodes(webpage, '.metacritic-score-box') |>
  html_text() |>
  as.integer() |>
  print()
 [1] 29 72 55 69 80 64 50 88 69 64 79 83 55 88 60 71 47 14 63 51 83 68 66 78 88
[26] 70 48 39 70 56 71 89 72 60 67 13 66 77 43 48 66 65 79 61 50 81 76 47 26 72
[51] 67 49 59 58 46 76 33 46 40 32 22 57 56 59 68 79 22 62 67 72 68 73 47 51 51
[76] 56 38 73 64 41 58 72 37 31 22 44 72 51 54 56 57
  • First let’s tally title (no missingness) and corresponding metascore (if present).
rank_and_metascore <- webpage |>
  html_nodes('.ipc-title-link-wrapper .ipc-title__text , .metacritic-score-box') |>
  html_text() |>
  # remove anything after the space
  str_remove(" .*") |>  
  print()
  [1] "1."   "29"   "2."   "72"   "3."   "4."   "55"   "5."   "69"   "6."  
 [11] "80"   "7."   "64"   "8."   "50"   "9."   "88"   "10."  "69"   "11." 
 [21] "64"   "12."  "79"   "13."  "83"   "14."  "55"   "15."  "88"   "16." 
 [31] "60"   "17."  "71"   "18."  "47"   "19."  "14"   "20."  "63"   "21." 
 [41] "51"   "22."  "83"   "23."  "68"   "24."  "66"   "25."  "78"   "26." 
 [51] "88"   "27."  "70"   "28."  "48"   "29."  "39"   "30."  "70"   "31." 
 [61] "56"   "32."  "71"   "33."  "89"   "34."  "72"   "35."  "36."  "60"  
 [71] "37."  "67"   "38."  "13"   "39."  "66"   "40."  "77"   "41."  "42." 
 [81] "43"   "43."  "48"   "44."  "66"   "45."  "46."  "65"   "47."  "79"  
 [91] "48."  "61"   "49."  "50."  "51."  "50"   "52."  "81"   "53."  "76"  
[101] "54."  "47"   "55."  "26"   "56."  "72"   "57."  "67"   "58."  "49"  
[111] "59."  "59"   "60."  "58"   "61."  "46"   "62."  "76"   "63."  "33"  
[121] "64."  "65."  "46"   "66."  "40"   "67."  "32"   "68."  "22"   "69." 
[131] "57"   "70."  "56"   "71."  "59"   "72."  "68"   "73."  "79"   "74." 
[141] "22"   "75."  "62"   "76."  "67"   "77."  "72"   "78."  "68"   "79." 
[151] "73"   "80."  "47"   "81."  "51"   "82."  "51"   "83."  "56"   "84." 
[161] "85."  "38"   "86."  "73"   "87."  "64"   "88."  "41"   "89."  "58"  
[171] "90."  "72"   "91."  "37"   "92."  "31"   "93."  "22"   "94."  "44"  
[181] "95."  "72"   "96."  "51"   "97."  "54"   "98."  "56"   "99."  "57"  
[191] "100."
# logical vector indicating if the element is a rank
isrank <- str_detect(rank_and_metascore, "\\.$")
# a rank followed by another rank is a missing metascore
ismissing <- isrank[1:(length(rank_and_metascore) - 1)] & isrank[2:(length(rank_and_metascore))]
# last entry is missing or not
ismissing[length(ismissing) + 1] <- isrank[length(isrank)]
# which ranks are missing metascore
missingpos <- as.integer(rank_and_metascore[ismissing])
metascore_data <- rep(NA, 100)
metascore_data[-missingpos] <- ms_data |> print()
 [1] 29 72 55 69 80 64 50 88 69 64 79 83 55 88 60 71 47 14 63 51 83 68 66 78 88
[26] 70 48 39 70 56 71 89 72 60 67 13 66 77 43 48 66 65 79 61 50 81 76 47 26 72
[51] 67 49 59 58 46 76 33 46 40 32 22 57 56 59 68 79 22 62 67 72 68 73 47 51 51
[76] 56 38 73 64 41 58 72 37 31 22 44 72 51 54 56 57

3.8 Visualizing movie data

  • Form a tibble:
# Combining all the lists to form a data frame
movies <- tibble(
  poprank = rank_data, 
  title = title_data,
  description = description_data, 
  runtime = runtime_data,
  filmrating = filmrating_data,
  userrating = userrating_data,
  metascore = metascore_data, 
  votes = votes_data,
) |>
  print(width=Inf)
# A tibble: 100 × 8
   poprank title                
     <int> <chr>                
 1       1 The Postcard Killings
 2       2 Promising Young Woman
 3       3 365 Days             
 4       4 Arkansas             
 5       5 Tenet                
 6       6 The Nest             
 7       7 Greyhound            
 8       8 The Hunt             
 9       9 Hamilton             
10      10 The Dry              
   description                                                                  
   <chr>                                                                        
 1 A New York detective investigates the death of his daughter who was murdered…
 2 A young woman, traumatized by a tragic event in her past, seeks out vengeanc…
 3 Massimo is a member of the Sicilian Mafia family and Laura is a sales direct…
 4 Kyle and Swin live by the orders of an Arkansas-based drug kingpin named Fro…
 5 Armed with only one word, Tenet, and fighting for the survival of the entire…
 6 Life for an entrepreneur and his American family begins to take a twisted tu…
 7 Several months after the U.S. entry into World War II, an inexperienced U.S.…
 8 Twelve strangers wake up in a clearing. They don't know where they are, or h…
 9 The real life of one of America's foremost founding fathers and first Secret…
10 Aaron Falk returns to his drought-stricken hometown to attend a tragic funer…
   runtime filmrating userrating metascore  votes
     <dbl> <chr>           <dbl>     <int>  <dbl>
 1     104 Not Rated         5.8        29  14773
 2     113 R                 7.5        72 207750
 3     114 TV-MA             3.3        NA  97303
 4     117 R                 6          55  14677
 5     150 PG-13             7.3        69 579532
 6     107 R                 6.3        80  17658
 7      91 PG-13             7          64 111867
 8      90 R                 6.5        50 126710
 9     160 PG-13             8.3        88 110869
10     117 R                 6.9        69  31996
# ℹ 90 more rows
  • Top 5 popular movies:
movies |>
  slice_min(order_by = poprank, n = 5) |>
  print(width=Inf)
# A tibble: 5 × 8
  poprank title                
    <int> <chr>                
1       1 The Postcard Killings
2       2 Promising Young Woman
3       3 365 Days             
4       4 Arkansas             
5       5 Tenet                
  description                                                                   
  <chr>                                                                         
1 A New York detective investigates the death of his daughter who was murdered …
2 A young woman, traumatized by a tragic event in her past, seeks out vengeance…
3 Massimo is a member of the Sicilian Mafia family and Laura is a sales directo…
4 Kyle and Swin live by the orders of an Arkansas-based drug kingpin named Frog…
5 Armed with only one word, Tenet, and fighting for the survival of the entire …
  runtime filmrating userrating metascore  votes
    <dbl> <chr>           <dbl>     <int>  <dbl>
1     104 Not Rated         5.8        29  14773
2     113 R                 7.5        72 207750
3     114 TV-MA             3.3        NA  97303
4     117 R                 6          55  14677
5     150 PG-13             7.3        69 579532
  • Top 5 user rated movies:
movies |>
  slice_max(order_by = userrating, n = 5) |>
  print(width = Inf)
# A tibble: 6 × 8
  poprank title                                                  
    <int> <chr>                                                  
1       9 Hamilton                                               
2      15 The Father                                             
3      95 Demon Slayer: Kimetsu no Yaiba - The Movie: Mugen Train
4      13 Soul                                                   
5      12 Another Round                                          
6      53 The Trial of the Chicago 7                             
  description                                                                   
  <chr>                                                                         
1 The real life of one of America's foremost founding fathers and first Secreta…
2 A man refuses all assistance from his daughter as he ages. As he tries to mak…
3 After his family was brutally murdered and his sister turned into a demon, Ta…
4 After landing the gig of a lifetime, a New York jazz pianist suddenly finds h…
5 Four high-school teachers consume alcohol on a daily basis to see how it affe…
6 The story of 7 people on trial stemming from various charges surrounding the …
  runtime filmrating userrating metascore  votes
    <dbl> <chr>           <dbl>     <int>  <dbl>
1     160 PG-13             8.3        88 110869
2      97 PG-13             8.2        88 187570
3     117 TV-MA             8.2        72  70992
4     100 PG                8          83 372388
5     117 Not Rated         7.7        79 190323
6     129 R                 7.7        76 191339
  • Top 5 meta scores:
movies |>
  slice_max(order_by = metascore, n = 5) |>
  print(width = Inf)
# A tibble: 6 × 8
  poprank title       
    <int> <chr>       
1      33 Minari      
2       9 Hamilton    
3      15 The Father  
4      26 Nomadland   
5      13 Soul        
6      22 Palm Springs
  description                                                                   
  <chr>                                                                         
1 A Korean American family moves to an Arkansas farm in search of its own Ameri…
2 The real life of one of America's foremost founding fathers and first Secreta…
3 A man refuses all assistance from his daughter as he ages. As he tries to mak…
4 A woman in her sixties, after losing everything in the Great Recession, embar…
5 After landing the gig of a lifetime, a New York jazz pianist suddenly finds h…
6 Stuck in a time loop, two wedding guests develop a budding romance while livi…
  runtime filmrating userrating metascore  votes
    <dbl> <chr>           <dbl>     <int>  <dbl>
1     115 PG-13             7.4        89  94833
2     160 PG-13             8.3        88 110869
3      97 PG-13             8.2        88 187570
4     107 R                 7.3        88 180085
5     100 PG                8          83 372388
6      90 R                 7.4        83 179394
  • How many top 100 movies are in each film rating category?
movies %>%
  count(filmrating)
# A tibble: 5 × 2
  filmrating     n
  <chr>      <int>
1 Not Rated     11
2 PG            12
3 PG-13         25
4 R             42
5 TV-MA         10
# bar plot
ggplot(data = movies) +
  geom_bar(mapping = aes(x = fct_infreq(filmrating))) + 
  labs(y = "count") +
  labs(x = "Film rating", y = "Count")

  • Is there a relationship between user rating and metascore (critics rating)? How to inform the number of votes? Stratify by film rating?
ggplot(data = movies, mapping = aes(x = userrating, y = metascore)) +
  geom_point(mapping = aes(size = votes, color = filmrating)) + 
  geom_smooth() +
  labs(y = "Metascore", x = "User rating")
Warning: Removed 9 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 9 rows containing missing values (`geom_point()`).

4 Example: Scraping finance data

  • quantmod package contains many utility functions for retrieving and plotting finance data. E.g.,
library(quantmod)
stock <- getSymbols("TSLA", src = "yahoo", auto.assign = FALSE, from = "2020-01-01")
head(stock)
           TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
2020-01-02  28.30000  28.71333 28.11400   28.68400   142981500      28.68400
2020-01-03  29.36667  30.26667 29.12800   29.53400   266677500      29.53400
2020-01-06  29.36467  30.10400 29.33333   30.10267   151995000      30.10267
2020-01-07  30.76000  31.44200 30.22400   31.27067   268231500      31.27067
2020-01-08  31.58000  33.23267 31.21533   32.80933   467164500      32.80933
2020-01-09  33.14000  33.25333 31.52467   32.08933   426606000      32.08933
chartSeries(stock, theme = chartTheme("white"),
            type = "line", log.scale = FALSE, TA = NULL)

5 Example: Pull tweets into R

library(twitteR) #load package
  • Step 1: apply for a Twitter developer account. It takes some time to get approved.

  • Step 2: Generate and copy the Twitter App Keys.

consumer_key <- 'XXXXXXXXXX'
consumer_secret <- 'XXXXXXXXXX'
access_token <- 'XXXXXXXXXX'
access_secret <- 'XXXXXXXXXX'
  • Step 3. Set up authentication
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
  • Step 4: Pull tweets
virus <- searchTwitter('#China + #Coronavirus', 
                       n = 1000, 
                       since = '2020-01-01', 
                       retryOnRateLimit = 1e3)
virus_df <- as_tibble(twListToDF(virus))
virus_df %>% print(width = Inf)