Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/feddelegrand7/ralger
ralger makes it easy to scrape a website. Built on the shoulders of titans: rvest, xml2.
https://github.com/feddelegrand7/ralger
dataextraction r rstats webcrawling webscraper-website webscraping
Last synced: 6 days ago
JSON representation
ralger makes it easy to scrape a website. Built on the shoulders of titans: rvest, xml2.
- Host: GitHub
- URL: https://github.com/feddelegrand7/ralger
- Owner: feddelegrand7
- License: other
- Created: 2020-02-18T15:21:00.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2024-07-16T09:17:18.000Z (7 months ago)
- Last Synced: 2025-01-28T16:12:59.667Z (14 days ago)
- Topics: dataextraction, r, rstats, webcrawling, webscraper-website, webscraping
- Language: R
- Homepage:
- Size: 1.02 MB
- Stars: 156
- Watchers: 6
- Forks: 14
- Open Issues: 3
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- jimsghstars - feddelegrand7/ralger - ralger makes it easy to scrape a website. Built on the shoulders of titans: rvest, xml2. (R)
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# ralger[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/ralger)](https://cran.r-project.org/package=ralger)
[![CRAN_time_from_release](https://www.r-pkg.org/badges/ago/ralger)](https://cran.r-project.org/package=ralger)
[![CRAN_latest_release_date](https://www.r-pkg.org/badges/last-release/ralger)](https://cran.r-project.org/package=ralger)
[![metacran downloads](https://cranlogs.r-pkg.org/badges/ralger)](https://cran.r-project.org/package=ralger)
[![metacran downloads](https://cranlogs.r-pkg.org/badges/grand-total/ralger)](https://cran.r-project.org/package=ralger)[![R badge](https://img.shields.io/badge/Build%20with-♥%20and%20R-blue)](https://github.com/feddelegrand7/ralger)
[![R badge](https://img.shields.io/badge/-Sponsor-brightgreen)](https://www.buymeacoffee.com/Fodil)
[![R build status](https://github.com/feddelegrand7/ralger/workflows/R-CMD-check/badge.svg)](https://github.com/feddelegrand7/ralger/actions)
[![Codecov test coverage](https://codecov.io/gh/feddelegrand7/ralger/branch/master/graph/badge.svg)](https://codecov.io/gh/feddelegrand7/ralger?branch=master)The goal of **ralger** is to facilitate web scraping in R. For a quick video tutorial, I gave a talk at useR2020, which you can find [here](https://www.youtube.com/watch?v=OHi6E8jegQg)
## Installation
You can install the `ralger` package from [CRAN](https://cran.r-project.org/) with:
```{r eval=FALSE}
install.packages("ralger")```
or you can install the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("feddelegrand7/ralger")
```
## `scrap()`This is an example which shows how to extract [top ranked universities' names](http://www.shanghairanking.com/rankings/arwu/2021) according to the ShanghaiRanking Consultancy:
```{r example}
library(ralger)my_link <- "http://www.shanghairanking.com/rankings/arwu/2021"
my_node <- "a span" # The element ID , I recommend SelectorGadget if you're not familiar with CSS selectors
clean <- TRUE # Should the function clean the extracted vector or not ? Default is FALSE
best_uni <- scrap(link = my_link, node = my_node, clean = clean)
head(best_uni, 10)
```
Thanks to the [robotstxt](https://github.com/ropensci/robotstxt), you can set `askRobot = TRUE` to ask the `robots.txt` file if it's permitted to scrape a specific web page.
If you want to scrap multiple list pages, just use `scrap()` in conjunction with `paste0()`.
Suppose that you want to scrape all `RStudio::conf 2021` speakers:```{r}
base_link <- "https://global.rstudio.com/student/catalog/list?category_ids=1796-speakers&page="
links <- paste0(base_link, 1:3) # the speakers are listed from page 1 to 3
node <- ".pr-1"
head(scrap(links, node), 10) # printing the first 10 speakers
```
## `attribute_scrap()`
If you need to scrape some elements' attributes, you can use the `attribute_scrap()` function as in the following example:
```{r}
# Getting all classes' names from the anchor elements
# from the ropensci websiteattributes <- attribute_scrap(link = "https://ropensci.org/",
node = "a", # the a tag
attr = "class" # getting the class attribute
)head(attributes, 10) # NA values are a tags without a class attribute
```Another example, let's we want to get all javascript dependencies within the same web page:
```{r}
js_depend <- attribute_scrap(link = "https://ropensci.org/",
node = "script",
attr = "src")js_depend
```
## `table_scrap()`
If you want to extract an __HTML Table__, you can use the `table_scrap()` function. Take a look at this [webpage](https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW) which lists the highest gross revenues in the cinema industry. You can extract the HTML table as follows:
```{r}
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")
head(data)
```
__When you deal with a web page that contains many HTML table you can use the `choose` argument to target a specific table__
## `tidy_scrap()`
Sometimes you'll find some useful information on the internet that you want to extract in a tabular manner however these information are not provided in an HTML format. In this context, you can use the `tidy_scrap()` function which returns a tidy data frame according to the arguments that you introduce. The function takes four arguments:
- **link** : the link of the website you're interested for;
- **nodes**: a vector of CSS elements that you want to extract. These elements will form the columns of your data frame;
- **colnames**: this argument represents the vector of names you want to assign to your columns. Note that you should respect the same order as within the **nodes** vector;
- **clean**: if true the function will clean the tibble's columns;
- **askRobot**: ask the robots.txt file if it's permitted to scrape the web page.### Example
We'll work on the famous [IMDb website](https://www.imdb.com/). Let's say we need a data frame composed of:
- The title of the 50 best ranked movies of all time
- Their release year
- Their ratingWe will need to use the `tidy_scrap()` function as follows:
```{r example3, message=FALSE, warning=FALSE}
my_link <- "https://www.imdb.com/search/title/?groups=top_250&sort=user_rating"
my_nodes <- c(
".lister-item-header a", # The title
".text-muted.unbold", # The year of release
".ratings-imdb-rating strong" # The rating)
)names <- c("title", "year", "rating") # respect the nodes order
tidy_scrap(link = my_link, nodes = my_nodes, colnames = names)
```
Note that all columns will be of *character* class. you'll have to convert them according to your needs.
## `titles_scrap()`
Using `titles_scrap()`, one can efficiently scrape titles which correspond to the _h1, h2 & h3_ HTML tags.
### Example
If we go to the [New York Times](https://www.nytimes.com/), we can easily extract the titles displayed within a specific web page :
```{r example4}
titles_scrap(link = "https://www.nytimes.com/")
```
Further, it's possible to filter the results using the `contain` argument:
```{r}
titles_scrap(link = "https://www.nytimes.com/", contain = "TrUMp", case_sensitive = FALSE)
```
## `paragraphs_scrap()`
In the same way, we can use the `paragraphs_scrap()` function to extract paragraphs. This function relies on the `p` HTML tag.
Let's get some paragraphs from the lovely [ropensci.org](https://ropensci.org/) website:
```{r}
paragraphs_scrap(link = "https://ropensci.org/")
```
If needed, it's possible to collapse the paragraphs into one bag of words:
```{r}
paragraphs_scrap(link = "https://ropensci.org/", collapse = TRUE)
```
## `weblink_scrap()`
`weblink_scrap()` is used to srape the web links available within a web page. Useful in some cases, for example, getting a list of the available PDFs:
```{r}
weblink_scrap(link = "https://www.worldbank.org/en/access-to-information/reports/",
contain = "PDF",
case_sensitive = FALSE)```
## `images_scrap() ` and `images_preview()`
`images_preview()` allows you to scrape the URLs of the images available within a web page so that you can choose which images __extension__ (see below) you want to focus on.
Let's say we want to list all the images from the official [RStudio](https://rstudio.com/) website:
```{r}
images_preview(link = "https://rstudio.com/")
```
`images_scrap()` on the other hand download the images. It takes the following arguments:
+ **link**: The URL of the web page;
+ **imgpath**: The destination folder of your images. It defaults to `getwd()`
+ **extn**: the extension of the image: jpg, png, jpeg ... among others;
+ **askRobot**: ask the robots.txt file if it's permitted to scrape the web page.
In the following example we extract all the `png` images from [RStudio](https://rstudio.com/) :
```{r, eval=FALSE}
# Suppose we're in a project which has a folder called my_images:
images_scrap(link = "https://rstudio.com/",
imgpath = here::here("my_images"),
extn = "png") # without the .```
# Accessibility related functions
## `images_noalt_scrap()`
`images_noalt_scrap()` can be used to get the images within a specific web page that don't have an `alt` attribute which can be annoying for people using a screen reader:
```{r}
images_noalt_scrap(link = "https://www.r-consortium.org/")
```
If no images without `alt` attributes are found, the function returns `NULL` and displays an indication message:```{r}
# WebAim is the reference website for web accessibilityimages_noalt_scrap(link = "https://webaim.org/techniques/forms/controls")
```## Code of Conduct
Please note that the ralger project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.