https://github.com/oxylabs/web-scraping-r

A tutorial for web scraping with R
https://github.com/oxylabs/web-scraping-r

proxy-scraper r-language web-scraping wikipedia-scraper

Last synced: 2 months ago
JSON representation

A tutorial for web scraping with R

Host: GitHub
URL: https://github.com/oxylabs/web-scraping-r
Owner: oxylabs
Created: 2022-02-28T13:10:10.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2025-02-11T12:47:27.000Z (3 months ago)
Last Synced: 2025-02-11T13:42:44.543Z (3 months ago)
Topics: proxy-scraper, r-language, web-scraping, wikipedia-scraper
Language: R
Homepage:
Size: 11.7 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Web Scraping With R

[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112)

[![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/GbxmdGhZjq)

[](https://github.com/topics/r) [](https://github.com/topics/web-scraping)

- [Installing requirements](#installing-requirements)

- [Web scraping with rvest](#web-scraping-with-rvest)

- [Web scraping with RSelenium](#web-scraping-with-rselenium)

This tutorial covers the basics of web scraping with R. We’ll begin with the scraping of static pages and shift the focus to the techniques that can be used for scraping data from dynamic websites that use JavaScript to render the content.

For a detailed explanation, see [this blog post](https://oxy.yt/1r8m). 

## Installing requirements

For macOS, run the following:

```shell

brew install r

brew install --cask r-studio

```

For Windows, run the following:

```batch

choco install r.project

choco install r.studio

```

### Installing required libraries

```R

install.packages("rvest")

install.packages("dplyr")

```

## Web scraping with rvest

```R

library(rvest)

link = "https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes"

page = read_html(link)

```

### Parsing HTML Content

```R

page %>% html_elements(css="")

page %>% html_elements(xpath="")

```

![](https://oxylabs.io/blog/images/2021/12/wiki_markup.png)

For above page, use the following:

```R

htmlElement <- page %>% html_element("table.sortable")

```

### Saving data to a data frame

```R

df <- html_table(htmlEl, header = FALSE)

names(df) <- df[2,]

df = df[-1:-2,]

```

### Exporting data frame to a CSV file

```R

write.csv(df, "iso_codes.csv")

```

### Downloading Images

```R

page <- read_html(url)

image_element <- page %>% html_element(".thumbborder")

image_url <- image_element %>% html_attr("src")

download.file(image_url, destfile = basename("paris.jpg"))

```

### Scrape Dynamic Pages with Rvest

Find the API endpoint and use that as following:

```R

page<-read_html(GET(api_url, timeout(10)))

jsontext <- page %>% html_element("p")  %>% html_text()

```

For a complete example, see [dynamic_rvest.R](src/dynamic_rvest.R).

## Web scraping with RSelenium

```R

install.package("RSelenium")

library(RSelenium)

```

### Starting Selenium

#### Method 1

```R

# Method 1

rD <- rsDriver(browser="chrome", port=9515L, verbose=FALSE)

remDr <- rD[["client"]]

```

#### Method 2

```shell

docker run -d -p 4445:4444 selenium/standalone-firefox

```

```R

remDr <- remoteDriver(

  remoteServerAddr = "localhost",

  port = 4445L,

  browserName = "firefox"

)

remDr$open()

```

### Working with elements in Selenium

```R

remDr$navigate("https://books.toscrape.com/catalogue/category/books/science-fiction_16")

```

![](https://oxylabs.io/blog/images/2021/12/book_title.png)

```R

titleElements <- remDr$findElements(using = "xpath", "//article//img")

titles <- sapply(titleElements, function(x){x$getElementAttribute("alt")[[1]]})

pricesElements <- remDr$findElements(using = "xpath", "//*[@class='price_color']")

prices <-  sapply(pricesElements, function(x){x$getElementText()[[1]]})

stockElements <- remDr$findElements(using = "xpath", "//*[@class='instock availability']")

stocks <-  sapply(stockElements, function(x){x$getElementText()[[1]]})

```

### Creating a data frame

```R

df <- data.frame(titles, prices, stocks)

```

#### Save CSV

```R

write.csv(df, "books.csv")

```

If you wish to find out more about web scraping with R, see our [blog post](https://oxy.yt/1r8m).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/oxylabs/web-scraping-r

Awesome Lists containing this project

README