Web Scraping in R

You can download this .qmd file from here. Just hit the Download Raw File button.

Credit to Brianna Heggeseth and Leslie Myint from Macalester College for a few of these descriptions and examples.

Using rvest for web scraping

Please see 08_table_scraping.qmd for a preview of web scraping techniques when no API exists, along with ethical considerations when scraping data. In this file, we will turn to scenarios when the webpage contains data of interest, but it is not already in table form.

Recall the four steps to scraping data with functions in the `rvest` library:

robotstxt::paths_allowed() Check if the website allows scraping, and then make sure we scrape “politely”
read_html(). Input the URL containing the data and turn the html code into an XML file (another markup format that’s easier to work with).
html_nodes(). Extract specific nodes from the XML file by using the CSS path that leads to the content of interest. (use css=“table” for tables.)
html_text(). Extract content of interest from nodes. Might also use html_table() etc.

More scraping ethics

`robots.txt`

robots.txt is a file that some websites will publish to clarify what can and cannot be scraped and other constraints about scraping. When a website publishes this file, this we need to comply with the information in it for moral and legal reasons.

We will look through the information in this tutorial and apply this to the NIH robots.txt file.

From our investigation of the NIH robots.txt, we learn:

User-agent: *: Anyone is allowed to scrape
Crawl-delay: 2: Need to wait 2 seconds between each page scraped
No Visit-time entry: no restrictions on time of day that scraping is allowed
No Request-rate entry: no restrictions on simultaneous requests
No mention of ?page=, news-events, news-releases, or https://science.education.nih.gov/ in the Disallow sections. (This is what we want to scrape today.)

robotstxt package

We can also use functions from the robotstxt package, which was built to download and parse robots.txt files (more info). Specifically, the paths_allowed() function can check if a bot has permission to access certain pages.

A timeout to preview some technical ideas

HTML structure

HTML (hypertext markup language) is the formatting language used to create webpages. We can see the core parts of HTML from the rvest vignette.

Finding CSS Selectors

In order to gather information from a webpage, we must learn the language used to identify patterns of specific information. For example, on the NIH News Releases page, we can see that the data is represented in a consistent pattern of image + title + abstract.

We will identify data in a web page using a pattern matching language called CSS Selectors that can refer to specific patterns in HTML, the language used to write web pages.

For example:

Selecting by tag:
- "a" selects all hyperlinks in a webpage (“a” represents “anchor” links in HTML)
- "p" selects all paragraph elements
Selecting by ID and class:
- ".description" selects all elements with class equal to “description”
  - The . at the beginning is what signifies class selection.
  - This is one of the most common CSS selectors for scraping because in HTML, the class attribute is extremely commonly used to format webpage elements. (Any number of HTML elements can have the same class, which is not true for the id attribute.)
- "#mainTitle" selects the SINGLE element with id equal to “mainTitle”
  - The # at the beginning is what signifies id selection.

<p class="title">Title of resource 1</p>
<p class="description">Description of resource 1</p>

<p class="title">Title of resource 2</p>
<p class="description">Description of resource 2</p>

Warning: Websites change often! So if you are going to scrape a lot of data, it is probably worthwhile to save and date a copy of the website. Otherwise, you may return after some time and your scraping code will include all of the wrong CSS selectors.

SelectorGadget

Although you can learn how to use CSS Selectors by hand, we will use a shortcut by installing the Selector Gadget tool.

There is a version available for Chrome–add it to Chrome via the Chome Web Store.
- Make sure to pin the extension to the menu bar. (Click the 3 dots > Extensions > Manage extensions. Click the “Details” button under SelectorGadget and toggle the “Pin to toolbar” option.)
There is also a version that can be saved as a bookmark in the browser–see here.

You might watch the Selector Gadget tutorial video.

Case Study: NIH News Releases

Our goal is to build a data frame with the article title, publication date, and abstract text for the 50 most recent NIH news releases.

Head over to the NIH News Releases page. Click the Selector Gadget extension icon or bookmark button. As you mouse over the webpage, different parts will be highlighted in orange. Click on the title (but not the live link portion!) of the first news release. You’ll notice that the Selector Gadget information in the lower right describes what you clicked on. (If SelectorGadget ever highlights too much in green, you can click on portions that you do not want to turn them red.)

Scroll through the page to verify that only the information you intend (the description paragraph) is selected. The selector panel shows the CSS selector (.teaser-title) and the number of matches for that CSS selector (10). (You may have to be careful with your clicking–there are two overlapping boxes, and clicking on the link of the title can lead to the CSS selector of “a”.)

[Pause to Ponder:] Repeat the process above to find the correct selectors for the following fields. Make sure that each matches 10 results:

The publication date

.date-display-single

The article abstract paragraph (which will also include the publication date)

.teaser-description

Retrieving Data Using `rvest` and CSS Selectors

Now that we have identified CSS selectors for the information we need, let’s fetch the data using the rvest package similarly to our approach in 08_table_scraping.qmd.

# check that scraping is allowed (Step 0)
robotstxt::paths_allowed("https://www.nih.gov/news-events/news-releases")


 www.nih.gov

[1] TRUE

# Step 1: Download the HTML and turn it into an XML file with read_html()
nih <- read_html("https://www.nih.gov/news-events/news-releases")

Finding the exact node (e.g. “.teaser-title”) is the tricky part. Among all the html code used to produce a webpage, where do you go to grab the content of interest? This is where SelectorGadget comes to the rescue!

# Step 2: Extract specific nodes with html_nodes()
title_temp <- html_nodes(nih, ".teaser-title")
title_temp

{xml_nodeset (10)}
 [1] <h4 class="teaser-title"><a href="/news-events/news-releases/many-genes- ...
 [2] <h4 class="teaser-title"><a href="/news-events/news-releases/hhs-nih-lau ...
 [3] <h4 class="teaser-title"><a href="/news-events/news-releases/nih-priorit ...
 [4] <h4 class="teaser-title"><a href="/news-events/news-releases/nih-study-r ...
 [5] <h4 class="teaser-title"><a href="/news-events/news-releases/nih-researc ...
 [6] <h4 class="teaser-title"><a href="/news-events/news-releases/annual-repo ...
 [7] <h4 class="teaser-title"><a href="/news-events/news-releases/repurposing ...
 [8] <h4 class="teaser-title"><a href="/news-events/news-releases/scientists- ...
 [9] <h4 class="teaser-title"><a href="/news-events/news-releases/twins-grow- ...
[10] <h4 class="teaser-title"><a href="/news-events/news-releases/ai-screenin ...

# Step 3: Extract content from nodes with html_text(), html_name(), 
#    html_attrs(), html_children(), html_table(), etc.
# Usually will still need to do some stringr adjustments
title_vec <- html_text(title_temp)
title_vec

 [1] "Many genes in male and female placentas expressed differently"                                   
 [2] "HHS, NIH launch next-generation universal vaccine platform for pandemic-prone viruses"           
 [3] "NIH to prioritize human-based research technologies"                                             
 [4] "NIH study reveals how inflammation makes touch painful"                                          
 [5] "NIH researchers supercharge ordinary clinical device to get a better look at the back of the eye"
 [6] "Annual Report to the Nation: Cancer deaths continue to decline"                                  
 [7] "Repurposing a blood pressure drug may prevent vision loss in inherited blinding diseases"        
 [8] "Scientists map unprecedented detail of connections and visual perception in the mouse brain"     
 [9] "Twins grow more slowly in early pregnancy than previously thought"                               
[10] "AI screening for opioid use disorder associated with fewer hospital readmissions"

You can also write this altogether with a pipe:

robotstxt::paths_allowed("https://www.nih.gov/news-events/news-releases")


 www.nih.gov

[1] TRUE

read_html("https://www.nih.gov/news-events/news-releases") |>
  html_nodes(".teaser-title") |>
  html_text()

 [1] "Many genes in male and female placentas expressed differently"                                   
 [2] "HHS, NIH launch next-generation universal vaccine platform for pandemic-prone viruses"           
 [3] "NIH to prioritize human-based research technologies"                                             
 [4] "NIH study reveals how inflammation makes touch painful"                                          
 [5] "NIH researchers supercharge ordinary clinical device to get a better look at the back of the eye"
 [6] "Annual Report to the Nation: Cancer deaths continue to decline"                                  
 [7] "Repurposing a blood pressure drug may prevent vision loss in inherited blinding diseases"        
 [8] "Scientists map unprecedented detail of connections and visual perception in the mouse brain"     
 [9] "Twins grow more slowly in early pregnancy than previously thought"                               
[10] "AI screening for opioid use disorder associated with fewer hospital readmissions"

And finally we wrap the 4 steps above into the bow and scrape functions from the polite package:

session <- bow("https://www.nih.gov/news-events/news-releases", force = TRUE)

nih_title <- scrape(session) |>
  html_nodes(".teaser-title") |>
  html_text()
nih_title

 [1] "Many genes in male and female placentas expressed differently"                                   
 [2] "HHS, NIH launch next-generation universal vaccine platform for pandemic-prone viruses"           
 [3] "NIH to prioritize human-based research technologies"                                             
 [4] "NIH study reveals how inflammation makes touch painful"                                          
 [5] "NIH researchers supercharge ordinary clinical device to get a better look at the back of the eye"
 [6] "Annual Report to the Nation: Cancer deaths continue to decline"                                  
 [7] "Repurposing a blood pressure drug may prevent vision loss in inherited blinding diseases"        
 [8] "Scientists map unprecedented detail of connections and visual perception in the mouse brain"     
 [9] "Twins grow more slowly in early pregnancy than previously thought"                               
[10] "AI screening for opioid use disorder associated with fewer hospital readmissions"

Putting multiple columns of data together.

Now repeat the process above to extract the publication date and the abstract.

nih_pubdate <- scrape(session) |>
  html_nodes(".date-display-single") |>
  html_text()
nih_pubdate

 [1] "May 1, 2025"    "May 1, 2025"    "April 29, 2025" "April 23, 2025"
 [5] "April 23, 2025" "April 21, 2025" "April 15, 2025" "April 9, 2025" 
 [9] "April 7, 2025"  "April 3, 2025"

nih_description <- scrape(session) |>
  html_nodes(".teaser-description") |>
  html_text()
nih_description

 [1] "May 1, 2025 —     \n          NIH findings may lead to insights on pregnancy complications, adult health.\r\n\r\n "                                                
 [2] "May 1, 2025 —     \n          These vaccines aim to provide broad-spectrum protection against multiple strains of multiple viruses. "                              
 [3] "April 29, 2025 —     \n          New initiative aims to reduce use of animals in NIH-funded research. "                                                            
 [4] "April 23, 2025 —     \n          Researchers uncover the cellular and molecular basis for sensing heat and inflammatory pain. "                                    
 [5] "April 23, 2025 —     \n          New technique brings retina into sharper focus\r\n\r\n. "                                                                         
 [6] "April 21, 2025 —     \n          Overall death rates from cancer declined steadily among both men and women from 2001 through 2022. "                              
 [7] "April 15, 2025 —     \n          NIH studies in animals show reserpine protects retinal-neurons necessary for vision, especially in females.  "                    
 [8] "April 9, 2025 —     \n          NIH-funded project helps unravel the brain’s wiring, giving clues to how we see the world. "                                       
 [9] "April 7, 2025 —     \n          NIH findings could lead to more efficient monitoring of twin pregnancies. "                                                        
[10] "April 3, 2025 —     \n          NIH-supported clinical trial shows AI tool as effective as healthcare providers in generating referrals to addiction specialists. "

Combine these extracted variables into a single tibble. Make sure the variables are formatted correctly - e.g. pubdate has date type, description does not contain the pubdate, etc.

# use tibble() to put multiple columns together into a tibble
nih_top10 <- tibble(title = nih_title, 
                    pubdate = nih_pubdate, 
                    description = nih_description)
nih_top10

# A tibble: 10 × 3
   title                                                     pubdate description
   <chr>                                                     <chr>   <chr>      
 1 Many genes in male and female placentas expressed differ… May 1,… "May 1, 20…
 2 HHS, NIH launch next-generation universal vaccine platfo… May 1,… "May 1, 20…
 3 NIH to prioritize human-based research technologies       April … "April 29,…
 4 NIH study reveals how inflammation makes touch painful    April … "April 23,…
 5 NIH researchers supercharge ordinary clinical device to … April … "April 23,…
 6 Annual Report to the Nation: Cancer deaths continue to d… April … "April 21,…
 7 Repurposing a blood pressure drug may prevent vision los… April … "April 15,…
 8 Scientists map unprecedented detail of connections and v… April … "April 9, …
 9 Twins grow more slowly in early pregnancy than previousl… April … "April 7, …
10 AI screening for opioid use disorder associated with few… April … "April 3, …

# now clean the data
nih_top10 <- nih_top10 |>
  mutate(pubdate = mdy(pubdate),
         description = str_trim(str_replace(description, ".*\\n", "")))
nih_top10

# A tibble: 10 × 3
   title                                                  pubdate    description
   <chr>                                                  <date>     <chr>      
 1 Many genes in male and female placentas expressed dif… 2025-05-01 "NIH findi…
 2 HHS, NIH launch next-generation universal vaccine pla… 2025-05-01 "These vac…
 3 NIH to prioritize human-based research technologies    2025-04-29 "New initi…
 4 NIH study reveals how inflammation makes touch painful 2025-04-23 "Researche…
 5 NIH researchers supercharge ordinary clinical device … 2025-04-23 "New techn…
 6 Annual Report to the Nation: Cancer deaths continue t… 2025-04-21 "Overall d…
 7 Repurposing a blood pressure drug may prevent vision … 2025-04-15 "NIH studi…
 8 Scientists map unprecedented detail of connections an… 2025-04-09 "NIH-funde…
 9 Twins grow more slowly in early pregnancy than previo… 2025-04-07 "NIH findi…
10 AI screening for opioid use disorder associated with … 2025-04-03 "NIH-suppo…

NOW - continue this process to build a tibble with the most recent 50 NIH news releases, which will require that you iterate over 5 webpages! You should write at least one function, and you will need iteration–use both a for loop and appropriate map_() functions from purrr. Some additional hints:

Mouse over the page buttons at the very bottom of the news home page to see what the URLs look like.
Include Sys.sleep(2) in your function to respect the Crawl-delay: 2 in the NIH robots.txt file.
Recall that bind_rows() from dplyr takes a list of data frames and stacks them on top of each other.

[Pause to Ponder:] Create a function to scrape a single NIH press release page by filling missing pieces labeled ???:

# Helper function to reduce html_nodes() |> html_text() code duplication
get_text_from_page <- function(page, css_selector) {
  page |> 
    html_nodes(css_selector) |> 
    html_text()
}

# Main function to scrape and tidy desired attributes
scrape_page <- function(url) {
    Sys.sleep(2)
    page <- read_html(url)
    article_titles <- get_text_from_page(page, ".teaser-title")
    article_dates <- get_text_from_page(page, ".date-display-single")
    article_dates <- mdy(article_dates)
    article_description <- get_text_from_page(page, ".teaser-description")
    article_description <- str_trim(str_replace(article_description, 
                                                ".*\\n", 
                                                "")
                                    )
    
    tibble(
      title = article_titles, 
      pubdate = article_dates, 
      description = article_description
    )
}

scrape_page("https://www.nih.gov/news-events/news-releases")

[Pause to Ponder:] Use a for loop over the first 5 pages:

"https://www.nih.gov/news-events/news-releases?2025&page=1&1="

pages <- vector("list", length = 6)
pos <- 0
# for (i in 0:4) {
#   url <- str_c("https://www.nih.gov/news-events/news-releases?page=", i)
#   pages[[i + 1]] <- scrape_page(url)
# }

for (i in 2025:2024) {
  for (j in 0:2) {
    pos <- pos + 1
    url <- str_c("https://www.nih.gov/news-events/news-releases?", i,
                 "&page=", j, "&1=")
    pages[[pos]] <- scrape_page(url)
  }
}


df_articles <- bind_rows(pages)
head(df_articles)

[Pause to Ponder:] Use map functions in the purrr package:

# Create a character vector of URLs for the first 5 pages
base_url <- "???"
urls_all_pages <- c(base_url, str_c(???))

pages2 <- purrr::map(???)
df_articles2 <- bind_rows(pages2)
head(df_articles2)

On Your Own

Go to https://www.bestplaces.net and search for Minneapolis, Minnesota. This is a site some people use when comparing cities they might consider working in and/or moving to. Using SelectorGadget, extract the following pieces of information from the Minneapolis page:

property crime (on a scale from 0 to 100)
minimum income required for a single person to live comfortably
average monthly rent for a 2-bedroom apartment
the “about” paragraph (the very first paragraph above “Location Details”)

Write a function called scrape_bestplaces() with arguments for state and city. When you run, for example, scrape_bestplaces("minnesota", "minneapolis"), the output should be a 1 x 6 tibble with columns for state, city, crime, min_income_single, rent_2br, and about.
Create a 5 x 6 tibble by running scrape_bestplaces() 5 times with 5 cities you are interested in. You might have to combine tibbles using bind_rows(). Be sure you look at the URL at bestplaces.net for the various cities to make sure it works as you expect. For bonus points, create the same 5 x 6 tibble for the same 5 cities using purrr:map2!