# Specifying the url for desired website to be scrapedurl <-"https://www.imdb.com/search/title/?title_type=feature&release_date=2020-01-01,2020-12-31&count=100"# Reading the HTML code from the website(webpage <-read_html(url))
Suppose we want to scrape following 11 features from this page:
Rank (popularity)
Title
Description
Runtime
Film rating
User rating
Metascore
Votes
3.1 Rank and title
Use SelectorGadget to find the CSS selector .ipc-title-link-wrapper .ipc-title__text.
# Using CSS selectors to scrap the title section(title_data_html <-html_nodes(webpage, '.ipc-title-link-wrapper .ipc-title__text'))
{xml_nodeset (100)}
[1] <h3 class="ipc-title__text">1. The Postcard Killings</h3>
[2] <h3 class="ipc-title__text">2. Promising Young Woman</h3>
[3] <h3 class="ipc-title__text">3. 365 Days</h3>
[4] <h3 class="ipc-title__text">4. Arkansas</h3>
[5] <h3 class="ipc-title__text">5. Tenet</h3>
[6] <h3 class="ipc-title__text">6. The Nest</h3>
[7] <h3 class="ipc-title__text">7. Greyhound</h3>
[8] <h3 class="ipc-title__text">8. The Hunt</h3>
[9] <h3 class="ipc-title__text">9. Hamilton</h3>
[10] <h3 class="ipc-title__text">10. The Dry</h3>
[11] <h3 class="ipc-title__text">11. Greenland</h3>
[12] <h3 class="ipc-title__text">12. Another Round</h3>
[13] <h3 class="ipc-title__text">13. Soul</h3>
[14] <h3 class="ipc-title__text">14. The Devil All the Time</h3>
[15] <h3 class="ipc-title__text">15. The Father</h3>
[16] <h3 class="ipc-title__text">16. Birds of Prey</h3>
[17] <h3 class="ipc-title__text">17. Emma.</h3>
[18] <h3 class="ipc-title__text">18. Sonic the Hedgehog</h3>
[19] <h3 class="ipc-title__text">19. After We Collided</h3>
[20] <h3 class="ipc-title__text">20. Love and Monsters</h3>
...
# Converting the title data to text(ranktitle_data <-html_text(title_data_html))
[1] "1. The Postcard Killings"
[2] "2. Promising Young Woman"
[3] "3. 365 Days"
[4] "4. Arkansas"
[5] "5. Tenet"
[6] "6. The Nest"
[7] "7. Greyhound"
[8] "8. The Hunt"
[9] "9. Hamilton"
[10] "10. The Dry"
[11] "11. Greenland"
[12] "12. Another Round"
[13] "13. Soul"
[14] "14. The Devil All the Time"
[15] "15. The Father"
[16] "16. Birds of Prey"
[17] "17. Emma."
[18] "18. Sonic the Hedgehog"
[19] "19. After We Collided"
[20] "20. Love and Monsters"
[21] "21. Trolls World Tour"
[22] "22. Palm Springs"
[23] "23. Enola Holmes"
[24] "24. I Care a Lot"
[25] "25. I'm Thinking of Ending Things"
[26] "26. Nomadland"
[27] "27. The Old Guard"
[28] "28. Underwater"
[29] "29. Ava"
[30] "30. Alone"
[31] "31. Extraction"
[32] "32. A Quiet Place Part II"
[33] "33. Minari"
[34] "34. The Invisible Man"
[35] "35. The Unhealer"
[36] "36. Wonder Woman 1984"
[37] "37. Run"
[38] "38. Run Hide Fight"
[39] "39. Pieces of a Woman"
[40] "40. Relic"
[41] "41. The Empty Man"
[42] "42. The New Mutants"
[43] "43. The Call of the Wild"
[44] "44. Mulan"
[45] "45. The Silencing"
[46] "46. The Courier"
[47] "47. Shiva Baby"
[48] "48. Onward"
[49] "49. The Call"
[50] "50. The Last Champion"
[51] "51. Eurovision Song Contest: The Story of Fire Saga"
[52] "52. Riders of Justice"
[53] "53. The Trial of the Chicago 7"
[54] "54. The Witches"
[55] "55. Dolittle"
[56] "56. Possessor"
[57] "57. Freaky"
[58] "58. Spenser Confidential"
[59] "59. Bad Boys for Life"
[60] "60. The Midnight Sky"
[61] "61. Rebecca"
[62] "62. Zola"
[63] "63. The Wrong Missy"
[64] "64. The Forgotten Battle"
[65] "65. You Should Have Left"
[66] "66. Unhinged"
[67] "67. The Secret: Dare to Dream"
[68] "68. The Tax Collector"
[69] "69. Lost Girls and Love Hotels"
[70] "70. The Croods: A New Age"
[71] "71. The Banker"
[72] "72. The Night House"
[73] "73. Mank"
[74] "74. Fantasy Island"
[75] "75. The Rental"
[76] "76. The King of Staten Island"
[77] "77. The Dark and the Wicked"
[78] "78. Borat Subsequent Moviefilm"
[79] "79. The World to Come"
[80] "80. Monster Hunter"
[81] "81. We Can Be Heroes"
[82] "82. Peninsula"
[83] "83. Boss Level"
[84] "84. Palm Swings"
[85] "85. Hillbilly Elegy"
[86] "86. News of the World"
[87] "87. Gretel & Hansel"
[88] "88. Finding You"
[89] "89. Nocturne"
[90] "90. Ammonite"
[91] "91. Body Cam"
[92] "92. Inheritance"
[93] "93. The Babysitter: Killer Queen"
[94] "94. Bloodshot"
[95] "95. Demon Slayer: Kimetsu no Yaiba - The Movie: Mugen Train"
[96] "96. Project Power"
[97] "97. Becky"
[98] "98. French Exit"
[99] "99. Simple Passion"
[100] "100. #Alive"
[1] "The Postcard Killings"
[2] "Promising Young Woman"
[3] "365 Days"
[4] "Arkansas"
[5] "Tenet"
[6] "The Nest"
[7] "Greyhound"
[8] "The Hunt"
[9] "Hamilton"
[10] "The Dry"
[11] "Greenland"
[12] "Another Round"
[13] "Soul"
[14] "The Devil All the Time"
[15] "The Father"
[16] "Birds of Prey"
[17] "Emma."
[18] "Sonic the Hedgehog"
[19] "After We Collided"
[20] "Love and Monsters"
[21] "Trolls World Tour"
[22] "Palm Springs"
[23] "Enola Holmes"
[24] "I Care a Lot"
[25] "I'm Thinking of Ending Things"
[26] "Nomadland"
[27] "The Old Guard"
[28] "Underwater"
[29] "Ava"
[30] "Alone"
[31] "Extraction"
[32] "A Quiet Place Part II"
[33] "Minari"
[34] "The Invisible Man"
[35] "The Unhealer"
[36] "Wonder Woman 1984"
[37] "Run"
[38] "Run Hide Fight"
[39] "Pieces of a Woman"
[40] "Relic"
[41] "The Empty Man"
[42] "The New Mutants"
[43] "The Call of the Wild"
[44] "Mulan"
[45] "The Silencing"
[46] "The Courier"
[47] "Shiva Baby"
[48] "Onward"
[49] "The Call"
[50] "The Last Champion"
[51] "Eurovision Song Contest: The Story of Fire Saga"
[52] "Riders of Justice"
[53] "The Trial of the Chicago 7"
[54] "The Witches"
[55] "Dolittle"
[56] "Possessor"
[57] "Freaky"
[58] "Spenser Confidential"
[59] "Bad Boys for Life"
[60] "The Midnight Sky"
[61] "Rebecca"
[62] "Zola"
[63] "The Wrong Missy"
[64] "The Forgotten Battle"
[65] "You Should Have Left"
[66] "Unhinged"
[67] "The Secret: Dare to Dream"
[68] "The Tax Collector"
[69] "Lost Girls and Love Hotels"
[70] "The Croods: A New Age"
[71] "The Banker"
[72] "The Night House"
[73] "Mank"
[74] "Fantasy Island"
[75] "The Rental"
[76] "The King of Staten Island"
[77] "The Dark and the Wicked"
[78] "Borat Subsequent Moviefilm"
[79] "The World to Come"
[80] "Monster Hunter"
[81] "We Can Be Heroes"
[82] "Peninsula"
[83] "Boss Level"
[84] "Palm Swings"
[85] "Hillbilly Elegy"
[86] "News of the World"
[87] "Gretel & Hansel"
[88] "Finding You"
[89] "Nocturne"
[90] "Ammonite"
[91] "Body Cam"
[92] "Inheritance"
[93] "The Babysitter: Killer Queen"
[94] "Bloodshot"
[95] "Demon Slayer: Kimetsu no Yaiba - The Movie: Mugen Train"
[96] "Project Power"
[97] "Becky"
[98] "French Exit"
[99] "Simple Passion"
[100] "#Alive"
3.2 Description
# Using CSS selectors to scrap the description section(description_data_html <-html_nodes(webpage, '.ipc-html-content-inner-div'))
{xml_nodeset (100)}
[1] <div class="ipc-html-content-inner-div">A New York detective investigate ...
[2] <div class="ipc-html-content-inner-div">A young woman, traumatized by a ...
[3] <div class="ipc-html-content-inner-div">Massimo is a member of the Sicil ...
[4] <div class="ipc-html-content-inner-div">Kyle and Swin live by the orders ...
[5] <div class="ipc-html-content-inner-div">Armed with only one word, Tenet, ...
[6] <div class="ipc-html-content-inner-div">Life for an entrepreneur and his ...
[7] <div class="ipc-html-content-inner-div">Several months after the U.S. en ...
[8] <div class="ipc-html-content-inner-div">Twelve strangers wake up in a cl ...
[9] <div class="ipc-html-content-inner-div">The real life of one of America' ...
[10] <div class="ipc-html-content-inner-div">Aaron Falk returns to his drough ...
[11] <div class="ipc-html-content-inner-div">A family struggles for survival ...
[12] <div class="ipc-html-content-inner-div">Four high-school teachers consum ...
[13] <div class="ipc-html-content-inner-div">After landing the gig of a lifet ...
[14] <div class="ipc-html-content-inner-div">Sinister characters converge aro ...
[15] <div class="ipc-html-content-inner-div">A man refuses all assistance fro ...
[16] <div class="ipc-html-content-inner-div">After splitting with the Joker, ...
[17] <div class="ipc-html-content-inner-div">In 1800s England, a well meaning ...
[18] <div class="ipc-html-content-inner-div">After discovering a small, blue, ...
[19] <div class="ipc-html-content-inner-div">Based on the 2014 romance novel ...
[20] <div class="ipc-html-content-inner-div">Seven years after he survived th ...
...
# Converting the description data to textdescription_data <-html_text(description_data_html)# take a look at first fewhead(description_data)
[1] "A New York detective investigates the death of his daughter who was murdered while on her honeymoon in London; he recruits the help of a Scandinavian journalist when other couples throughout Europe suffer a similar fate."
[2] "A young woman, traumatized by a tragic event in her past, seeks out vengeance against those who crossed her path."
[3] "Massimo is a member of the Sicilian Mafia family and Laura is a sales director. She does not expect that on a trip to Sicily trying to save her relationship, Massimo will kidnap her and give her 365 days to fall in love with him."
[4] "Kyle and Swin live by the orders of an Arkansas-based drug kingpin named Frog, whom they've never met. But when a deal goes horribly wrong, the consequences are deadly."
[5] "Armed with only one word, Tenet, and fighting for the survival of the entire world, a Protagonist journeys through a twilight world of international espionage on a mission that will unfold in something beyond real time."
[6] "Life for an entrepreneur and his American family begins to take a twisted turn after moving into an English country manor."
3.3 Runtime
Retrieve runtime strings
# Using CSS selectors to scrap the Movie runtime sectionruntime_text <- webpage |>html_nodes('.dli-title-metadata-item:nth-child(2)') |>html_text() |>print()
We encounter the issue of missing data when scraping metascore.
We see there are only 91 meta scores. 9 movies don’t have meta scores. We may manually find which movies don’t have meta scores but that’s tedious and not reproducible.
# Using CSS selectors to scrap the metascore sectionms_data <-html_nodes(webpage, '.metacritic-score-box') |>html_text() |>as.integer() |>print()
# logical vector indicating if the element is a rankisrank <-str_detect(rank_and_metascore, "\\.$")# a rank followed by another rank is a missing metascoreismissing <- isrank[1:(length(rank_and_metascore) -1)] & isrank[2:(length(rank_and_metascore))]# last entry is missing or notismissing[length(ismissing) +1] <- isrank[length(isrank)]# which ranks are missing metascoremissingpos <-as.integer(rank_and_metascore[ismissing])metascore_data <-rep(NA, 100)metascore_data[-missingpos] <- ms_data |>print()
# Combining all the lists to form a data framemovies <-tibble(poprank = rank_data, title = title_data,description = description_data, runtime = runtime_data,filmrating = filmrating_data,userrating = userrating_data,metascore = metascore_data, votes = votes_data,) |>print(width=Inf)
# A tibble: 100 × 8
poprank title
<int> <chr>
1 1 The Postcard Killings
2 2 Promising Young Woman
3 3 365 Days
4 4 Arkansas
5 5 Tenet
6 6 The Nest
7 7 Greyhound
8 8 The Hunt
9 9 Hamilton
10 10 The Dry
description
<chr>
1 A New York detective investigates the death of his daughter who was murdered…
2 A young woman, traumatized by a tragic event in her past, seeks out vengeanc…
3 Massimo is a member of the Sicilian Mafia family and Laura is a sales direct…
4 Kyle and Swin live by the orders of an Arkansas-based drug kingpin named Fro…
5 Armed with only one word, Tenet, and fighting for the survival of the entire…
6 Life for an entrepreneur and his American family begins to take a twisted tu…
7 Several months after the U.S. entry into World War II, an inexperienced U.S.…
8 Twelve strangers wake up in a clearing. They don't know where they are, or h…
9 The real life of one of America's foremost founding fathers and first Secret…
10 Aaron Falk returns to his drought-stricken hometown to attend a tragic funer…
runtime filmrating userrating metascore votes
<dbl> <chr> <dbl> <int> <dbl>
1 104 Not Rated 5.8 29 14773
2 113 R 7.5 72 207750
3 114 TV-MA 3.3 NA 97303
4 117 R 6 55 14677
5 150 PG-13 7.3 69 579532
6 107 R 6.3 80 17658
7 91 PG-13 7 64 111867
8 90 R 6.5 50 126710
9 160 PG-13 8.3 88 110869
10 117 R 6.9 69 31996
# ℹ 90 more rows
Top 5 popular movies:
movies |>slice_min(order_by = poprank, n =5) |>print(width=Inf)
# A tibble: 5 × 8
poprank title
<int> <chr>
1 1 The Postcard Killings
2 2 Promising Young Woman
3 3 365 Days
4 4 Arkansas
5 5 Tenet
description
<chr>
1 A New York detective investigates the death of his daughter who was murdered …
2 A young woman, traumatized by a tragic event in her past, seeks out vengeance…
3 Massimo is a member of the Sicilian Mafia family and Laura is a sales directo…
4 Kyle and Swin live by the orders of an Arkansas-based drug kingpin named Frog…
5 Armed with only one word, Tenet, and fighting for the survival of the entire …
runtime filmrating userrating metascore votes
<dbl> <chr> <dbl> <int> <dbl>
1 104 Not Rated 5.8 29 14773
2 113 R 7.5 72 207750
3 114 TV-MA 3.3 NA 97303
4 117 R 6 55 14677
5 150 PG-13 7.3 69 579532
Top 5 user rated movies:
movies |>slice_max(order_by = userrating, n =5) |>print(width =Inf)
# A tibble: 6 × 8
poprank title
<int> <chr>
1 9 Hamilton
2 15 The Father
3 95 Demon Slayer: Kimetsu no Yaiba - The Movie: Mugen Train
4 13 Soul
5 12 Another Round
6 53 The Trial of the Chicago 7
description
<chr>
1 The real life of one of America's foremost founding fathers and first Secreta…
2 A man refuses all assistance from his daughter as he ages. As he tries to mak…
3 After his family was brutally murdered and his sister turned into a demon, Ta…
4 After landing the gig of a lifetime, a New York jazz pianist suddenly finds h…
5 Four high-school teachers consume alcohol on a daily basis to see how it affe…
6 The story of 7 people on trial stemming from various charges surrounding the …
runtime filmrating userrating metascore votes
<dbl> <chr> <dbl> <int> <dbl>
1 160 PG-13 8.3 88 110869
2 97 PG-13 8.2 88 187570
3 117 TV-MA 8.2 72 70992
4 100 PG 8 83 372388
5 117 Not Rated 7.7 79 190323
6 129 R 7.7 76 191339
Top 5 meta scores:
movies |>slice_max(order_by = metascore, n =5) |>print(width =Inf)
# A tibble: 6 × 8
poprank title
<int> <chr>
1 33 Minari
2 9 Hamilton
3 15 The Father
4 26 Nomadland
5 13 Soul
6 22 Palm Springs
description
<chr>
1 A Korean American family moves to an Arkansas farm in search of its own Ameri…
2 The real life of one of America's foremost founding fathers and first Secreta…
3 A man refuses all assistance from his daughter as he ages. As he tries to mak…
4 A woman in her sixties, after losing everything in the Great Recession, embar…
5 After landing the gig of a lifetime, a New York jazz pianist suddenly finds h…
6 Stuck in a time loop, two wedding guests develop a budding romance while livi…
runtime filmrating userrating metascore votes
<dbl> <chr> <dbl> <int> <dbl>
1 115 PG-13 7.4 89 94833
2 160 PG-13 8.3 88 110869
3 97 PG-13 8.2 88 187570
4 107 R 7.3 88 180085
5 100 PG 8 83 372388
6 90 R 7.4 83 179394
How many top 100 movies are in each film rating category?
movies %>%count(filmrating)
# A tibble: 5 × 2
filmrating n
<chr> <int>
1 Not Rated 11
2 PG 12
3 PG-13 25
4 R 42
5 TV-MA 10
# bar plotggplot(data = movies) +geom_bar(mapping =aes(x =fct_infreq(filmrating))) +labs(y ="count") +labs(x ="Film rating", y ="Count")
Is there a relationship between user rating and metascore (critics rating)? How to inform the number of votes? Stratify by film rating?
ggplot(data = movies, mapping =aes(x = userrating, y = metascore)) +geom_point(mapping =aes(size = votes, color = filmrating)) +geom_smooth() +labs(y ="Metascore", x ="User rating")