Web scraping

There is a wealth of data on internet. How to scrape them and analyze them?


rvest is an R package written by Hadley Wickham which makes web scraping easy.

Example: Scraping from webpage


  • Use SelectorGadget to highlight the element we want to scrape

  • Use the CSS selector to get the rankings

    # Use CSS selectors to scrap the rankings section
    (rank_data_html <- html_nodes(webpage, '.text-primary'))
    # Convert the ranking data to text
    (rank_data <- html_text(rank_data_html))
  • Use SelectorGadget to find the CSS selector .lister-item-header a.

    # Using CSS selectors to scrap the title section
    (title_data_html <- html_nodes(webpage, '.lister-item-header a'))
    # Converting the title data to text
    (title_data <- html_text(title_data_html))
  • # Using CSS selectors to scrap the description section
    (description_data_html <- html_nodes(webpage, '.ratings-bar+ .text-muted'))
    # Converting the description data to text
    description_data <- html_text(description_data_html)
    # take a look at first few
    ## [1] "\nTormented by his past, a garbage man named Clean attempts a quiet life of redemption. But, soon finds himself forced to reconcile with the violence of his past."                                                                     
    ## [2] "\nMassimo is a member of the Sicilian Mafia family and Laura is a sales director. She does not expect that on a trip to Sicily trying to save her relationship, Massimo will kidnap her and give her 365 days to fall in love with him."
    ## [3] "\nA man refuses all assistance from his daughter as he ages. As he tries to make sense of his changing circumstances, he begins to doubt his loved ones, his own mind and even the fabric of his reality."                              
    ## [4] "\nArmed with only one word, Tenet, and fighting for the survival of the entire world, a Protagonist journeys through a twilight world of international espionage on a mission that will unfold in something beyond real time."          
    ## [5] "\nA young woman, traumatized by a tragic event in her past, seeks out vengeance against those who crossed her path."                                                                                                                    
    ## [6] "\nIn 1800s England, a well meaning but selfish young woman meddles in the love lives of her friends."
    # strip the '\n'
    description_data <- str_replace(description_data, "^\\n", "")
    ## [1] "Tormented by his past, a garbage man named Clean attempts a quiet life of redemption. But, soon finds himself forced to reconcile with the violence of his past."                                                                     
    ## [2] "Massimo is a member of the Sicilian Mafia family and Laura is a sales director. She does not expect that on a trip to Sicily trying to save her relationship, Massimo will kidnap her and give her 365 days to fall in love with him."
    ## [3] "A man refuses all assistance from his daughter as he ages. As he tries to make sense of his changing circumstances, he begins to doubt his loved ones, his own mind and even the fabric of his reality."                              
    ## [4] "Armed with only one word, Tenet, and fighting for the survival of the entire world, a Protagonist journeys through a twilight world of international espionage on a mission that will unfold in something beyond real time."          
    ## [5] "A young woman, traumatized by a tragic event in her past, seeks out vengeance against those who crossed her path."                                                                                                                    
    ## [6] "In 1800s England, a well meaning but selfish young woman meddles in the love lives of her friends."


  • Retrieve runtime data
# Using CSS selectors to scrap the Movie runtime section
(runtime_data <- webpage %>%
  html_nodes('.runtime') %>%
  html_text() %>%
  str_replace(" min", "") %>%
  • Collect the (first) genre of each movie:

    genre_data <- webpage %>%
      # Using CSS selectors to scrap the Movie genre section
      html_nodes('.genre') %>%
      # Converting the genre data to text
      html_text() %>%
      # Data-Preprocessing: retrieve the first word
  • Rating data:

    rating_data <- webpage %>%
      html_nodes('.ratings-imdb-rating strong') %>%
      html_text() %>%
  • Vote data

    votes_data <- webpage %>%
      html_nodes('.sort-num_votes-visible span:nth-child(2)') %>%
      html_text() %>% 
      str_replace(",", "") %>% 
  • Director information

    directors_data <- webpage %>% 
      html_nodes('.text-muted+ p a:nth-child(1)') %>% 
  • Only the first actor

    actors_data <- webpage %>%
      html_nodes('.lister-item-content .ghost+ a') %>%
  • We encounter the issue of missing data when scraping metascore.

  • We see there are only 90 meta scores. 10 movies don’t have meta scores. We may manually find which movies don’t have meta scores but that’s tedious and not reproducible.

    # Using CSS selectors to scrap the metascore section
    ms_data_html <- html_nodes(webpage, '.metascore')
    # Converting the runtime data to text
    ms_data <- html_text(ms_data_html)
    # Let's have a look at the metascore 
    ms_data <- str_replace(ms_data, "\\s*$", "") %>% as.integer()
  • First let’s tally index and corresponding metascore (if present).

    rank_and_metascore <- webpage %>%
      html_nodes('.unfavorable , .text-primary , .favorable , .mixed') %>%
      html_text() %>%
      str_replace("\\s*$", "") %>%
  • Be careful with missing data.

    # Using CSS selectors to scrap the gross revenue section
    gross_data_html <- html_nodes(webpage,'.ghost~ .text-muted+ span')
    # Converting the gross revenue data to text
    gross_data <- html_text(gross_data_html)
    # Let's have a look at the gross data
    # Data-Preprocessing: removing '$' and 'M' signs
    gross_data <- str_replace(gross_data, "M", "")
    gross_data <- str_sub(gross_data, 2, 10)
    #(gross_data <- str_extract(gross_data, "[:digit:]+.[:digit:]+"))
    gross_data <- as.numeric(gross_data)
    # Let's check the length of gross data
    85 (out of 100) movies don’t have gross data yet! We need a better way to figure out missing entries.

    (rank_and_gross <- webpage %>%
      # retrieve rank and gross
      html_nodes('.ghost~ .text-muted+ span , .text-primary') %>%
      html_text() %>%
      str_replace("\\s+", "") %>%
      str_replace_all("[$M]", ""))
Visualizing movie data

  • Form a tibble:

    # Combining all the lists to form a data frame
    movies <- tibble(Rank = rank_data, 
                     Title = title_data,
                     Description = description_data, 
                     Runtime = runtime_data,
                     Genre = genre_data, 
                     Rating = rating_data,
                     Metascore = metascore_data, 
                     Votes = votes_data,
                     Gross_Earning_in_Mil = gross_data,
                     Director = directors_data, 
                     Actor = actors_data)
    movies %>% print(width=Inf)
  • How many top 100 movies are in each genre? (Be careful with interpretation.)

    movies %>%
      ggplot() +
      geom_bar(mapping = aes(x = Genre))

  • Which genre is most profitable in terms of average gross earnings?

    movies %>%
      group_by(Genre) %>%
      summarise(avg_earning = mean(Gross_Earning_in_Mil, na.rm = TRUE)) %>%
      ggplot() +
        geom_col(mapping = aes(x = Genre, y = avg_earning)) + 
        labs(y = "avg earning in millions")
    ## Warning: Removed 6 rows containing missing values (position_stack).

    ggplot(data = movies) +
      geom_boxplot(mapping = aes(x = Genre, y = Gross_Earning_in_Mil)) + 
      labs(y = "Gross earning in millions")
    ## Warning: Removed 85 rows containing non-finite values (stat_boxplot).

  • Is there a relationship between gross earning and rating? Find the best selling movie (by gross earning) in each genre

    (best_in_genre <- movies %>%
        group_by(Genre) %>%
        filter(row_number(desc(Gross_Earning_in_Mil)) == 1)) %>%
        print(width = Inf)
    ## # A tibble: 4 × 11
    ## # Groups:   Genre [4]
    ##    Rank Title                
    ##   <int> <chr>                
    ## 1     8 A Quiet Place Part II
    ## 2    13 Dolittle             
    ## 3    57 Bad Boys for Life    
    ## 4    66 Onward               
    ##   Description                                                                   
    ##   <chr>                                                                         
    ## 1 Following the events at home, the Abbott family now face the terrors of the o…
    ## 2 A physician who can talk to animals embarks on an adventure to find a legenda…
    ## 3 Miami detectives Mike Lowrey and Marcus Burnett must face off against a mothe…
    ## 4 Two elven brothers embark on a quest to bring their father back for one day.  
    ##   Runtime Genre     Rating Metascore  Votes Gross_Earning_in_Mil Director      
    ##     <int> <chr>      <dbl>     <int>  <dbl>                <dbl> <chr>         
    ## 1      97 Drama        7.3        71 194104                160.  John Krasinski
    ## 2     101 Adventure    5.6        26  59472                 77.0 Stephen Gaghan
    ## 3     124 Action       6.5        59 151720                206.  Adil El Arbi  
    ## 4     102 Animation    7.4        61 134889                 61.6 Dan Scanlon   
    ##   Actor            
    ##   <chr>            
    ## 1 Emily Blunt      
    ## 2 Robert Downey Jr.
    ## 3 Will Smith       
    ## 4 Tom Holland
    ggplot(movies, mapping = aes(x = Rating, y = Gross_Earning_in_Mil)) +
      geom_point(mapping = aes(size = Votes, color = Genre)) + 
      ggrepel::geom_label_repel(aes(label = Title), data = best_in_genre) +
      labs(y = "Gross earning in millions")
    ## Warning: Removed 85 rows containing missing values (geom_point).

Example: Scraping finance data

Example: Pull tweets into R

library(twitteR) #load package
consumer_key <- 'XXXXXXXXXX'
consumer_secret <- 'XXXXXXXXXX'
access_token <- 'XXXXXXXXXX'
access_secret <- 'XXXXXXXXXX'
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
virus <- searchTwitter('#China + #Coronavirus', 
                       n = 1000, 
                       since = '2020-01-01', 
                       retryOnRateLimit = 1e3)
virus_df <- as_tibble(twListToDF(virus))
virus_df %>% print(width = Inf)