https://github.com/dcs-training/web-scraping-r-2024
https://github.com/dcs-training/web-scraping-r-2024
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/dcs-training/web-scraping-r-2024
- Owner: DCS-training
- Created: 2024-10-15T11:34:57.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-10-21T12:23:12.000Z (8 months ago)
- Last Synced: 2025-01-07T22:54:54.308Z (5 months ago)
- Size: 1.98 MB
- Stars: 1
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Scraping Websites with R
21 October 2024A workshop covering the basics of collecting data from the Internet using web scraping in R.
## Getting ready
You'll need to install the SelectorGadget extension for the Chrome browser in addition to ensuring you have access to R/RStudio. You can use R/RStudio on the University's Noteable service, which requires no installation. You can also install R/Rstudio locally on your computer.### Installing SelectorGadget
1. Open the Chrome browser. If you don't have Chrome, you can install it [here](https://www.google.com/intl/en_uk/chrome/dr/download/?brand=GBSK&ds_kid=43700079594422197&gad_source=1&gclid=CjwKCAjw68K4BhAuEiwAylp3kjjhQbHyA1O0gttd5L0f0a48baOHb_zDWgZlWXotXEhTdnld7o1szRoCRSoQAvD_BwE&gclsrc=aw.ds).
2. Log in to the Chrome browser with your Google account. If you don't have a Google account, you'll need to sign up for one.
3. Follow [this link](https://chromewebstore.google.com/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=en) to download the SelectorGadget browser extension.
4. Make sure SelectorGadget is enabled in your Chrome extension toolbar by clicking on the 'Customize and control Google Chrome' icon in the top righthand corner of your browser pane. Mouse over to 'Extensions' on the dropdown menu and select 'Manage Extensions.' Make sure SelectorGadget is toggled 'on.'### Accessing R On Noteable
1. Open the login page [here](https://noteable.edina.ac.uk/login).
2. Login with your EASE credentials.
3. Select RStudio as a personal notebook server and press start.
4. Go to File > New Project> Version Control > Git
5. Copy and Paste this repository URL [https://github.com/DCS-training/web-scraping-R-2024](https://github.com/DCS-training/web-scraping-R-2024) as the Repository URL. (The Project directory name will be filled in automatically, but you can change it if you want your folder in Notable to have a different name).
6. Decide where you'd like to store the folder. By default, it will be located in your home directory.
7. Select 'Create Project'.
Congratulations! You have now pulled the content of this repository to your Notable server space.### Installing R/RStudio locally
1. Go to (https://www.r-project.org/)[https://www.r-project.org/]
2. Go to the download link
3. Choose the closest CRAN mirror (if you're in Edinburgh, select either Bristol or Imperial College London)
4. Download the version compatible with your operating system (Windows/Mac/Linux)
- For Windows: click on install R for the first time. Then download R for Windows and follow the installation widget. If you get stuck, follow this (video tutorial)[https://www.youtube.com/watch?v=GAGUDL-4aVw]
- For Mac: download the most recent pkg file and follow the installation widget. If you get stuck, follow this (video tutorial)[https://www.youtube.com/watch?v=EmZqlcKkJMM]
5. Once R is installed, you can install R studio (R interface)
6. Go to (www.rstudio.com)[www.rstudio.com]
7. Find the download page
8. Download the version compatible with your operating system## Overview
This intermediate workshop will teach you how to scrape data from the Internet using R. We will start with a theoretical introduction to web scraping and specific approaches to scraping static websites with a focus on HTML tags. Then, we will practice using rvest to scrape data from a webpage. Finally, we will discuss methods for scraping information from webpages with a more complex structure.This is an intermediate-level course. Students must have a basic background in R. This includes understanding the basic data types in R; how to install and load packages; and how to use functions, pipes, and apply/map functions. Alternatively, the 'Introduction to Programming with R and RStudio' course counts as a prerequisite.
## Schedule
14:00-15:00Introduction & housekeeping
Review of web scraping, HTML/CSS, and webpage structure
Hands-on tutorial: scraping a static webpageBreak
15:10-16:00
Hands-on tutorial: scraping content using an API
Additional methods for web scraping and further resources## Additional resources
[Reddit community](https://www.reddit.com/r/webscraping/) focused on web scrapingProgramming Historian, [Fetching and Parsing Data from the Web with OpenRefine](https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine) and [Automated Downloading with Wget](https://programminghistorian.org/en/lessons/automated-downloading-with-wget)
University of Southampton [Web Data Research Assistant](https://www.southampton.ac.uk/~lac/WebDataResearchAssistant/)
[Public APIs](https://github.com/public-apis/public-apis) GitHub repo
[Tutorial for browser automation](https://www.youtube.com/watch?v=HpL6EX2kjq4) in Python using Selenium
## License
All material here collected is free to use but it is covered by a [](https://creativecommons.org/licenses/by-nc/4.0/) license