https://github.com/oxylabs/web-scraping-r
A tutorial for web scraping with R
https://github.com/oxylabs/web-scraping-r
proxy-scraper r-language web-scraping wikipedia-scraper
Last synced: 2 months ago
JSON representation
A tutorial for web scraping with R
- Host: GitHub
- URL: https://github.com/oxylabs/web-scraping-r
- Owner: oxylabs
- Created: 2022-02-28T13:10:10.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2025-02-11T12:47:27.000Z (3 months ago)
- Last Synced: 2025-02-11T13:42:44.543Z (3 months ago)
- Topics: proxy-scraper, r-language, web-scraping, wikipedia-scraper
- Language: R
- Homepage:
- Size: 11.7 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Web Scraping With R
[](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112)
[](https://discord.gg/GbxmdGhZjq)
[
](https://github.com/topics/r) [
](https://github.com/topics/web-scraping)
- [Installing requirements](#installing-requirements)
- [Web scraping with rvest](#web-scraping-with-rvest)
- [Web scraping with RSelenium](#web-scraping-with-rselenium)This tutorial covers the basics of web scraping with R. We’ll begin with the scraping of static pages and shift the focus to the techniques that can be used for scraping data from dynamic websites that use JavaScript to render the content.
For a detailed explanation, see [this blog post](https://oxy.yt/1r8m).
## Installing requirements
For macOS, run the following:
```shell
brew install r
brew install --cask r-studio```
For Windows, run the following:
```batch
choco install r.project
choco install r.studio
```### Installing required libraries
```R
install.packages("rvest")
install.packages("dplyr")
```## Web scraping with rvest
```R
library(rvest)
link = "https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes"
page = read_html(link)```
### Parsing HTML Content
```R
page %>% html_elements(css="")
page %>% html_elements(xpath="")
```
For above page, use the following:
```R
htmlElement <- page %>% html_element("table.sortable")
```### Saving data to a data frame
```R
df <- html_table(htmlEl, header = FALSE)
names(df) <- df[2,]
df = df[-1:-2,]
```### Exporting data frame to a CSV file
```R
write.csv(df, "iso_codes.csv")
```### Downloading Images
```R
page <- read_html(url)
image_element <- page %>% html_element(".thumbborder")
image_url <- image_element %>% html_attr("src")
download.file(image_url, destfile = basename("paris.jpg"))
```### Scrape Dynamic Pages with Rvest
Find the API endpoint and use that as following:
```R
page<-read_html(GET(api_url, timeout(10)))
jsontext <- page %>% html_element("p") %>% html_text()
```
For a complete example, see [dynamic_rvest.R](src/dynamic_rvest.R).## Web scraping with RSelenium
```R
install.package("RSelenium")
library(RSelenium)```
### Starting Selenium
#### Method 1
```R
# Method 1
rD <- rsDriver(browser="chrome", port=9515L, verbose=FALSE)
remDr <- rD[["client"]]```
#### Method 2
```shell
docker run -d -p 4445:4444 selenium/standalone-firefox
``````R
remDr <- remoteDriver(
remoteServerAddr = "localhost",
port = 4445L,
browserName = "firefox"
)
remDr$open()
```### Working with elements in Selenium
```R
remDr$navigate("https://books.toscrape.com/catalogue/category/books/science-fiction_16")
```
```R
titleElements <- remDr$findElements(using = "xpath", "//article//img")
titles <- sapply(titleElements, function(x){x$getElementAttribute("alt")[[1]]})pricesElements <- remDr$findElements(using = "xpath", "//*[@class='price_color']")
prices <- sapply(pricesElements, function(x){x$getElementText()[[1]]})stockElements <- remDr$findElements(using = "xpath", "//*[@class='instock availability']")
stocks <- sapply(stockElements, function(x){x$getElementText()[[1]]})```
### Creating a data frame
```R
df <- data.frame(titles, prices, stocks)
```#### Save CSV
```R
write.csv(df, "books.csv")
```If you wish to find out more about web scraping with R, see our [blog post](https://oxy.yt/1r8m).