Mini-Project 2: Data Acquisition
Overview
In this project, you will find data on the web, acquire it through API or web scraping techniques, and create a tidy tibble suitable for future analysis
Groups
- Shen and Tenzin
- Gwynnie and Aria
- Ja Seng and Sujita
- Rachael and Olivia
- Morgan, Solveig, and Ela
- Parker and Josh
- Jean-Luc and Jordan
- Daniel and Cathal
Timeline
| Tentative due date | Points | ||
|---|---|---|---|
| Stage I | GitHub Collaboration | Thurs, Mar 13 | 10 |
| Stage II | Project Submission | Friday, Mar 21 | 90 |
| 100 |
Notes: a) we will not be doing an oral assessment piece, but we will be doing partner evaluations b) there will be some time provided in class, but you will be expected to work outside of class time with your partner as well
Stage I: GitHub Collaboration
You and your partner must (a) set up a GitHub repository where one person is the Owner and the other is a Collaborator, (b) connect the GitHub repository to an R project that you edit in your own RStudio on your laptop, and (c) certify that you and your partner can handle merge conflicts, should they arise.
To obtain your “certification” in (c), follow the instructions here while sitting with your partner. Read the entire Chapter 10, and then specifically follow Steps 1-6 in Section 10.2 (don’t worry about the Exercise box that follows Step 6) and then follow Steps 1-11 in Section 10.4 (again don’t worry about the Exercise box that follows Step 11). To receive your “certification”, you will upload screenshots as in Step 11 but with details from you and your partner.
The following notes may help as your working through the instructions in Chapter 10:
- You can enter
git config pull.rebase falsein the Terminal tab in RStudio - The Owner needs to check “set up ReadMe” when creating the initial repo
- Both the Owner and Collaborator need to connect to GitHub through RStudio (using File > New Project > Version Control) after the Owner sets up the initial GitHub repo. Keep checking that RStudio is the same at various stages.
- Before Step 4 (in Part 1), the Owner may have to remove the .Rproj and .gitignore folders, and they may also have to reload GitHub
- In Step 8 (Part 2), click Stage even though it looks like the blue box is filled
Stage II: Project Submission
For this project, you and your partner will find data on the web that’s not available as a neat .csv file or .xls spreadsheet (it maybe even be spread across multiple webpages), and you will acquire it using techniques from class and tidy it so it’s suitable for a future data analysis (including text analysis!).
You must use one (or multiple) of these approaches to data acquisition:
- APIs (using httr or httr2)
- An API wrapper package (that requires an API key)
- Harvesting data that appears in table form on a webpage (using rvest with html_table)
- Harvesting pieces of data that could appear anywhere on a webpage (using rvest with html_text)
Before you begin acquiring data, describe your motivations for obtaining the data you plan to gather, including questions you hope to answer. After collecting and tidying your data, describe how you might use your data to investigate questions of interest.
Your quarto file should contain at least one custom function and possibly iteration techniques as well. Be sure that it is well commented and that it conforms to style guidelines.
Submission and Rubric
Mini-Project 2 must be submitted on Moodle by 11:00 PM on Fri, Mar 21. You should submit a rendered pdf.
Check out this rubric for Mini-Project 2.