{"id":13857421,"url":"https://github.com/feddelegrand7/ralger","last_synced_at":"2025-04-06T08:15:41.333Z","repository":{"id":36684085,"uuid":"241394878","full_name":"feddelegrand7/ralger","owner":"feddelegrand7","description":"ralger makes it easy to scrape a website. Built on the shoulders of titans: rvest, xml2. ","archived":false,"fork":false,"pushed_at":"2024-07-16T09:17:18.000Z","size":1068,"stargazers_count":155,"open_issues_count":3,"forks_count":14,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-30T07:10:06.510Z","etag":null,"topics":["dataextraction","r","rstats","webcrawling","webscraper-website","webscraping"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/feddelegrand7.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"custom":["https://www.buymeacoffee.com/Fodil"]}},"created_at":"2020-02-18T15:21:00.000Z","updated_at":"2025-02-28T13:45:18.000Z","dependencies_parsed_at":"2024-02-09T01:47:42.950Z","dependency_job_id":"7ca5fc80-a00f-47b7-b05a-cadaff58c5ad","html_url":"https://github.com/feddelegrand7/ralger","commit_stats":{"total_commits":252,"total_committers":6,"mean_commits":42.0,"dds":0.0357142857142857,"last_synced_commit":"57ebc6b07511675c23d91007e701a9722aeb86d4"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feddelegrand7%2Fralger","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feddelegrand7%2Fralger/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feddelegrand7%2Fralger/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feddelegrand7%2Fralger/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/feddelegrand7","download_url":"https://codeload.github.com/feddelegrand7/ralger/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247451667,"owners_count":20940944,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataextraction","r","rstats","webcrawling","webscraper-website","webscraping"],"created_at":"2024-08-05T03:01:36.534Z","updated_at":"2025-04-06T08:15:41.312Z","avatar_url":"https://github.com/feddelegrand7.png","language":"R","funding_links":["https://www.buymeacoffee.com/Fodil"],"categories":["R"],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n\n```{r, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\"\n)\n```\n# ralger \u003ca\u003e\u003cimg src='man/figures/logo.png' align=\"right\" height=\"200\" /\u003e\u003c/a\u003e\n\n\n\n\u003c!-- badges: start --\u003e\n[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/ralger)](https://cran.r-project.org/package=ralger)\n[![CRAN_time_from_release](https://www.r-pkg.org/badges/ago/ralger)](https://cran.r-project.org/package=ralger)\n[![CRAN_latest_release_date](https://www.r-pkg.org/badges/last-release/ralger)](https://cran.r-project.org/package=ralger)\n[![metacran downloads](https://cranlogs.r-pkg.org/badges/ralger)](https://cran.r-project.org/package=ralger)\n[![metacran downloads](https://cranlogs.r-pkg.org/badges/grand-total/ralger)](https://cran.r-project.org/package=ralger)\n\u003c!-- [![license](https://img.shields.io/github/license/mashape/apistatus.svg)](https://choosealicense.com/licenses/mit/) --\u003e\n[![R badge](https://img.shields.io/badge/Build%20with-♥%20and%20R-blue)](https://github.com/feddelegrand7/ralger)\n[![R badge](https://img.shields.io/badge/-Sponsor-brightgreen)](https://www.buymeacoffee.com/Fodil)\n[![R build status](https://github.com/feddelegrand7/ralger/workflows/R-CMD-check/badge.svg)](https://github.com/feddelegrand7/ralger/actions)\n[![Codecov test coverage](https://codecov.io/gh/feddelegrand7/ralger/branch/master/graph/badge.svg)](https://codecov.io/gh/feddelegrand7/ralger?branch=master)\n\u003c!-- badges: end --\u003e\n\n\n\nThe goal of **ralger** is to facilitate web scraping in R. For a quick video tutorial, I gave a talk at useR2020, which you can find [here](https://www.youtube.com/watch?v=OHi6E8jegQg)\n\n## Installation\n\nYou can install the `ralger` package from [CRAN](https://cran.r-project.org/) with:\n\n```{r eval=FALSE}\ninstall.packages(\"ralger\")\n\n```\n\nor you can install the development version from [GitHub](https://github.com/) with:\n\n``` r\n# install.packages(\"devtools\")\ndevtools::install_github(\"feddelegrand7/ralger\")\n```\n## `scrap()`\n\nThis is an example which shows how to extract [top ranked universities' names](http://www.shanghairanking.com/rankings/arwu/2021) according to the ShanghaiRanking Consultancy:\n\n\n```{r example}\nlibrary(ralger)\n\nmy_link \u003c- \"http://www.shanghairanking.com/rankings/arwu/2021\"\n\nmy_node \u003c- \"a span\" # The element ID , I recommend SelectorGadget if you're not familiar with CSS selectors\n\nclean \u003c- TRUE # Should the function clean the extracted vector or not ? Default is FALSE\n\nbest_uni \u003c- scrap(link = my_link, node = my_node, clean = clean)\n\nhead(best_uni, 10)\n\n```\n\nThanks to the [robotstxt](https://github.com/ropensci/robotstxt), you can set `askRobot = TRUE` to ask the `robots.txt` file if it's permitted to scrape a specific web page.\n\nIf you want to scrap multiple list pages, just use `scrap()` in conjunction with `paste0()`.\nSuppose that you want to scrape all `RStudio::conf 2021` speakers:\n\n```{r}\n\nbase_link \u003c- \"https://global.rstudio.com/student/catalog/list?category_ids=1796-speakers\u0026page=\"\n\nlinks \u003c- paste0(base_link, 1:3) # the speakers are listed from page 1 to 3\n\nnode \u003c- \".pr-1\"\n\n\nhead(scrap(links, node), 10) # printing the first 10 speakers\n\n```\n\n## `attribute_scrap()`\n\nIf you need to scrape some elements' attributes, you can use the `attribute_scrap()` function as in the following example:\n\n\n```{r}\n# Getting all classes' names from the anchor elements\n# from the ropensci website\n\nattributes \u003c- attribute_scrap(link = \"https://ropensci.org/\",\n                node = \"a\", # the a tag\n                attr = \"class\" # getting the class attribute\n                )\n\nhead(attributes, 10) # NA values are a tags without a class attribute\n```\n\nAnother example, let's we want to get all javascript dependencies within the same web page:\n\n```{r}\n\njs_depend \u003c- attribute_scrap(link = \"https://ropensci.org/\",\n                             node = \"script\",\n                             attr = \"src\")\n\njs_depend\n\n```\n\n## `table_scrap()`\n\nIf you want to extract an __HTML Table__, you can use the `table_scrap()` function. Take a look at this [webpage](https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW) which lists the highest gross revenues in the cinema industry. You can extract the HTML table as follows:\n\n```{r}\n\n\ndata \u003c- table_scrap(link =\"https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW\")\n\nhead(data)\n\n\n```\n\n__When you deal with a web page that contains many HTML table you can use the `choose` argument to target a specific table__\n\n\n## `tidy_scrap()`\n\nSometimes you'll find some useful information on the internet that you want to extract in a tabular manner however these information are not provided in an HTML format. In this context, you can use the `tidy_scrap()` function which returns a tidy data frame according to the arguments that you introduce. The function takes four arguments:\n\n- **link** : the link of the website you're interested for;\n- **nodes**: a vector of CSS elements that you want to extract. These elements will form the columns of your data frame;\n- **colnames**: this argument represents the vector of names you want to assign to your columns. Note that you should respect the same order as within the **nodes** vector;\n- **clean**: if true the function will clean the tibble's columns;\n- **askRobot**: ask the robots.txt file if it's permitted to scrape the web page.\n\n### Example\n\nWe'll work on the famous [IMDb website](https://www.imdb.com/). Let's say we need a data frame composed of:\n\n- The title of the 50 best ranked movies of all time\n- Their release year\n- Their rating\n\nWe will need to use the `tidy_scrap()` function as follows:\n\n```{r example3, message=FALSE, warning=FALSE}\n\nmy_link \u003c- \"https://www.imdb.com/search/title/?groups=top_250\u0026sort=user_rating\"\n\nmy_nodes \u003c- c(\n  \".lister-item-header a\", # The title\n  \".text-muted.unbold\", # The year of release\n  \".ratings-imdb-rating strong\" # The rating)\n  )\n\nnames \u003c- c(\"title\", \"year\", \"rating\") # respect the nodes order\n\n\ntidy_scrap(link = my_link, nodes = my_nodes, colnames = names)\n\n\n```\n\nNote that all columns will be of *character* class. you'll have to convert them according to your needs.\n\n\n\n## `titles_scrap()`\n\nUsing `titles_scrap()`, one can efficiently scrape titles which correspond to the _h1, h2 \u0026 h3_ HTML tags.\n\n\n\n### Example\n\nIf we go to the [New York Times](https://www.nytimes.com/), we can easily extract the titles displayed within a specific web page :\n\n\n```{r example4}\n\n\ntitles_scrap(link = \"https://www.nytimes.com/\")\n\n\n\n```\n\nFurther, it's possible to filter the results using the `contain` argument:\n\n\n```{r}\n\ntitles_scrap(link = \"https://www.nytimes.com/\", contain = \"TrUMp\", case_sensitive = FALSE)\n\n\n\n```\n\n\n## `paragraphs_scrap()`\n\n\nIn the same way, we can use the `paragraphs_scrap()` function to extract paragraphs. This function relies on the `p` HTML tag.\n\nLet's get some paragraphs from the lovely [ropensci.org](https://ropensci.org/) website:\n\n\n```{r}\n\nparagraphs_scrap(link = \"https://ropensci.org/\")\n\n```\n\nIf needed, it's possible to collapse the paragraphs into one bag of words:\n\n\n```{r}\n\nparagraphs_scrap(link = \"https://ropensci.org/\", collapse = TRUE)\n\n```\n\n\n## `weblink_scrap()`\n\n`weblink_scrap()` is used to srape the web links available within a web page. Useful in some cases, for example, getting a list of the available PDFs:\n\n\n```{r}\n\nweblink_scrap(link = \"https://www.worldbank.org/en/access-to-information/reports/\",\n              contain = \"PDF\",\n              case_sensitive = FALSE)\n\n\n```\n\n## `images_scrap() ` and `images_preview()`\n\n`images_preview()` allows you to scrape the URLs of the images available within a web page so that you can choose which images __extension__ (see below) you want to focus on.\n\nLet's say we want to list all the images from the official [RStudio](https://rstudio.com/) website:\n\n\n```{r}\n\nimages_preview(link = \"https://rstudio.com/\")\n\n```\n\n`images_scrap()` on the other hand download the images. It takes the following arguments:\n\n+ **link**: The URL of the web page;\n\n+ **imgpath**: The destination folder of your images. It defaults to `getwd()`\n\n+ **extn**: the extension of the image: jpg, png, jpeg ... among others;\n\n+ **askRobot**: ask the robots.txt file if it's permitted to scrape the web page.\n\n\nIn the following example we extract all the `png` images from [RStudio](https://rstudio.com/)  :\n\n\n```{r, eval=FALSE}\n\n# Suppose we're in a project which has a folder called my_images:\n\nimages_scrap(link = \"https://rstudio.com/\",\n             imgpath = here::here(\"my_images\"),\n             extn = \"png\") # without the .\n\n```\n\n\n# Accessibility related functions\n\n\n## `images_noalt_scrap()`\n\n\n`images_noalt_scrap()` can be used to get the images within a specific web page that don't have an `alt` attribute which can be annoying for people using a screen reader:\n\n\n```{r}\n\nimages_noalt_scrap(link = \"https://www.r-consortium.org/\")\n\n```\nIf no images without `alt` attributes are found, the function returns `NULL` and displays an indication message:\n\n\n```{r}\n# WebAim is the reference website for web accessibility\n\nimages_noalt_scrap(link = \"https://webaim.org/techniques/forms/controls\")\n```\n\n\n\n## Code of Conduct\n\nPlease note that the ralger project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffeddelegrand7%2Fralger","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffeddelegrand7%2Fralger","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffeddelegrand7%2Fralger/lists"}