{"id":13400637,"url":"https://github.com/dmi3kno/polite","last_synced_at":"2025-10-22T06:14:47.437Z","repository":{"id":34621740,"uuid":"141985119","full_name":"dmi3kno/polite","owner":"dmi3kno","description":"Be nice on the web","archived":false,"fork":false,"pushed_at":"2023-09-05T22:14:20.000Z","size":1889,"stargazers_count":327,"open_issues_count":6,"forks_count":13,"subscribers_count":6,"default_branch":"develop","last_synced_at":"2025-03-06T02:05:25.920Z","etag":null,"topics":["crawler","memoise","r","r-package","rate-limiter","robotstxt","rstats","rvest","scraper","webscraping"],"latest_commit_sha":null,"homepage":"https://dmi3kno.github.io/polite/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dmi3kno.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-07-23T08:30:36.000Z","updated_at":"2025-02-06T13:24:53.000Z","dependencies_parsed_at":"2024-01-18T11:03:43.885Z","dependency_job_id":"fdf42864-d15a-49d7-9f55-3d29be424073","html_url":"https://github.com/dmi3kno/polite","commit_stats":{"total_commits":66,"total_committers":6,"mean_commits":11.0,"dds":"0.28787878787878785","last_synced_commit":"70b2799619d33505cf38bc375646bdcfb89cd7d8"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmi3kno%2Fpolite","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmi3kno%2Fpolite/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmi3kno%2Fpolite/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmi3kno%2Fpolite/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dmi3kno","download_url":"https://codeload.github.com/dmi3kno/polite/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243537788,"owners_count":20307098,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","memoise","r","r-package","rate-limiter","robotstxt","rstats","rvest","scraper","webscraping"],"created_at":"2024-07-30T19:00:54.145Z","updated_at":"2025-10-22T06:14:42.394Z","avatar_url":"https://github.com/dmi3kno.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r setup, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\"\n)\n```\n# polite \u003cimg src=\"man/figures/logo.png\" align=\"right\" /\u003e\n\u003c!-- badges: start --\u003e\n[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/dmi3kno/polite?branch=master\u0026svg=true)](https://ci.appveyor.com/project/dmi3kno/polite)\n[![Codecov test coverage](https://codecov.io/gh/dmi3kno/polite/branch/master/graph/badge.svg)](https://app.codecov.io/gh/dmi3kno/polite?branch=master)\n[![CRAN status](https://www.r-pkg.org/badges/version/polite)](https://CRAN.R-project.org/package=polite)\n[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html#maturing)\n[![R-CMD-check](https://github.com/dmi3kno/polite/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/dmi3kno/polite/actions/workflows/R-CMD-check.yaml)\n\u003c!-- badges: end --\u003e\n\n\nThe goal of `polite` is to promote responsible web etiquette. \n\n\u003e __\"bow and scrape\" (verb):__ \n\u003e\n\u003e 1) To make a deep bow with the right leg drawn back (thus scraping the floor), left hand pressed across the abdomen, right arm held aside.\n\u003e\n\u003e 2) _(idiomatic, by extension)_ To behave in a servile, obsequious, or excessively polite manner. [1]                   \n\u003e                                             Source: _Wiktionary, The free dictionary_\n\u003e\n\nThe package's two main functions `bow` and `scrape` define and realize a web harvesting session. `bow` is used to introduce the client to the host and ask for permission to scrape (by inquiring against the host's `robots.txt` file), while `scrape` is the main function for retrieving data from the remote server. Once the connection is established, there's no need to `bow` again. Rather, in order to adjust a scraping URL the user can simply `nod` to the new path, which updates the session's URL, making sure that the new location can be negotiated against `robots.txt`.\n\nThe three pillars of a `polite session` are **seeking permission, taking slowly and never asking twice**.\n\nThe package builds on awesome toolkits for defining and managing http sessions (`httr` and `rvest`), declaring the user agent string and investigating site policies (`robotstxt`), and utilizing rate-limiting and response caching (`ratelimitr` and `memoise`).\n\n## Installation\n\nYou can install `polite` from [CRAN](https://cran.r-project.org/) with:\n\n```{r, eval=FALSE}\ninstall.packages(\"polite\")\n```\n\nDevelopment version of the package can be installed from [Github](https://github.com/dmi3kno/polite) with:\n\n```{r, eval=FALSE}\ninstall.packages(\"remotes\")\nremotes::install_github(\"dmi3kno/polite\")\n```\n\n\n## Basic Example\n\n\nThis is a basic example which shows how to retrieve the list of semi-soft cheeses from www.cheese.com. Here, we authenticate a session and then scrape the page with specified parameters. Behind the scenes `polite` retrieves `robots.txt`, checks the URL and user agent string against it, caches the call to `robots.txt` and to the web page and enforces rate limiting.\n\n```{r example, message=FALSE, warning=FALSE}\nlibrary(polite)\nlibrary(rvest)\n\nsession \u003c- bow(\"https://www.cheese.com/by_type\", force = TRUE)\nresult \u003c- scrape(session, query=list(t=\"semi-soft\", per_page=100)) %\u003e%\n  html_node(\"#main-body\") %\u003e% \n  html_nodes(\"h3\") %\u003e% \n  html_text()\nhead(result)\n```\n\n## Extended Example\n\nYou can build your own functions that incorporate `bow`, `scrape` (and, if required, `nod`). Here we will extend our inquiry into cheeses and will download all cheese names and URLs to their information pages. Let's retrieve the number of pages per letter in the alphabetical list, keeping the number of results per page to 100 to minimize number of web requests.\n\n```{r, warning=FALSE, message=FALSE, error=FALSE}\nlibrary(polite)\nlibrary(rvest)\nlibrary(purrr)\nlibrary(dplyr)\n\nsession \u003c- bow(\"https://www.cheese.com/alphabetical\")\n\n# this is only to illustrate the example.\nletters \u003c- letters[1:3] # delete this line to scrape all letters\n\nresponses \u003c- map(letters, ~scrape(session, query = list(per_page=100,i=.x)) )\nresults \u003c- map(responses, ~html_nodes(.x, \"#id_page li\") %\u003e% \n                           html_text(trim = TRUE) %\u003e% \n                           as.numeric() %\u003e%\n                           tail(1) ) %\u003e% \n           map(~pluck(.x, 1, .default=1))\npages_df \u003c- tibble(letter = rep.int(letters, times=unlist(results)),\n                   pages = unlist(map(results, ~seq.int(from=1, to=.x))))\npages_df\n```\n\nNow that we know how many pages to retrieve from each letter page, let's rotate over letter pages and retrieve cheese names and underlying links to cheese details. We will need to write a helper function. Our session is still valid and we don't need to `nod` again, because we will not be modifying a page URL, only its parameters (note that the field `url` is missing from `scrape` function).\n\n```{r}\nget_cheese_page \u003c- function(letter, pages){\n lnks \u003c- scrape(session, query=list(per_page=100,i=letter,page=pages)) %\u003e% \n    html_nodes(\"h3 a\")\ntibble(name=lnks %\u003e% html_text(),\n       link=lnks %\u003e% html_attr(\"href\"))\n}\n\ndf \u003c- pages_df %\u003e% pmap_df(get_cheese_page)\ndf\n```\n\n## Another example\n\nBob Rudis is one the vocal proponents of an online etiquette in the R community. If you have never seen his robots.txt file, you should definitely [check it out](https://rud.is/robots.txt)! Lets look at his [blog](https://rud.is/b/). We don't know how many pages will the gallery return, so we keep going until there’s no more “Older posts” button. Note that I first `bow` to the host and then simply `nod` to the current scraping page inside the `while` loop.\n\n```{r, eval=FALSE}\n    library(polite)\n    library(rvest)\n    \n    hrbrmstr_posts \u003c- data.frame()\n    url \u003c- \"https://rud.is/b/\"\n    session \u003c- bow(url)\n    \n    while(!is.na(url)){\n      # make it verbose\n      message(\"Scraping \", url)\n      # nod and scrape\n      current_page \u003c- nod(session, url) %\u003e% \n        scrape(verbose=TRUE)\n      # extract post titles\n      hrbrmstr_posts \u003c- current_page %\u003e% \n        html_nodes(\".entry-title a\") %\u003e% \n        polite::html_attrs_dfr() %\u003e% \n        rbind(hrbrmstr_posts)\n      # see if there's \"Older posts\" button\n      url \u003c- current_page %\u003e% \n        html_node(\".nav-previous a\") %\u003e% \n        html_attr(\"href\")\n    } # end while loop\n    \n    tibble::as_tibble(hrbrmstr_posts)\n    #\u003e # A tibble: 578 x3\n```\n\nWe organize the data into the tidy format and append it to our empty data frame. At the end we will discover that Bob has written over 570 blog articles, which I very much recommend anyone to check out.\n\n## Polite for package developers\n\nIf you are developing a package which accesses the web, `polite` can be used either as a *template*, or as a *backend* for your polite web session.\n\n### Polite template\n\nJust before its ascension to CRAN, the package acquired new functionality for helping package developers get started on creating polite web tools for the users. Any modern package developer is probably familiar with excellent [`usethis` package](https://github.com/r-lib/usethis) by Rstudio team. `usethis` is a collection of scripts for automating package development workflow. Many `usethis` functions automating repetitive tasks start with prefix `use_` indicating that what followed will be adopted and \"used\" by the package user developes. For details about `use_` family of functions, see [package documentation](https://usethis.r-lib.org/reference/index.html). \n\n`{polite}` has one usethis-like function called `polite::use_manners()`. \n\n```{r, eval=FALSE}\npolite::use_manners()\n```\n\nWhen called within the analysis (or package) directory, it creates a new file called `R/polite-scrape.R` (creating `R` directory if necessary) and populates it with template functions for creating polite web-scraping session. The functions provided by `polite::use_manners()` are drop-in replacements for two of the most popular tools in web-accessing R ecosystem: `read_html()` and `download.file()`. The only difference is that these functions have `polite_` prefix. In all other respects they should have look and feel of the original, i.e. in most cases you should be able to simply replace calls to `read_html()` with `polite_read_html()` and `download.file` with `polite_download_file()` and your code should work (provided you scrape from a `url`, which it the first required argument in both functions).\n\n### Polite backend\n\nRecent addition to polite package is a [`purrr`-like](https://purrr.tidyverse.org/reference/index.html#section-adverbs) adverb `politely()` which can make any web-accessing function \"polite\" by wrapping it with a code which delivers on four pillars of polite session: \n\n\u003e **Introduce Yourself, Seek Permission, Take Slowly and Never Ask Twice**.\n\nAdverbs can be useful, when a user (package developer) wants to \"delegate\" polite session handling to external package, without modifying the existing code. The only thing user needs to do is wrap existing verb with `politely()` and use the new function instead of the original.\n\nLet's say you wanted to use `httr::GET` for accessing certain API, such as `musicbrainz` and extract certain data from a deeply nested list, returned by the server. Your originally developed code looks like this:\n\n```{r}\nlibrary(magrittr)\nlibrary(httr)\nlibrary(xml2)\nlibrary(purrr)\n\nbeatles_res \u003c- GET(\"https://musicbrainz.org/ws/2/artist/\", \n                   query=list(query=\"Beatles\", limit=10),\n                   httr::accept(\"application/json\")) \nif(!is.null(beatles_res)) beatles_lst \u003c- httr::content(beatles_res, type = \"application/json\")\n\nstr(beatles_lst, max.level = 2)\n```\n\nThis code does not comply with `polite` principles. It does not provide human-readable user-agent string, it does not consult `robots.txt` about permissions. It is possible to run this code in the loop and (accidentally) overwhelm the server with requests.  It does not cache the results, so if this code is re-run again, data will be re-queried.\n\nYou could write your own infastructure for handling useragent, robots.txt, rate limiting and memoisation, or you could simply use an adverb `politely()` which does all of these things for you.\n\n### Querying colormind.io with polite backend\n\nHere's an example from using colormind.io API. We will need a couple of service functions to convert colors between HEX and RGB and to prepare a json [required by the service](http://colormind.io/api-access/).\n\n```{r}\nrgba2hex \u003c- function(r,g,b,a) {grDevices::rgb(r, g, b, a, maxColorValue = 255)}\n\nhex2rgba \u003c- function(x, alpha=TRUE){t(grDevices::col2rgb(x, alpha = alpha))}\n\nprepare_colormind_query \u003c- function(x, model){\n  lst \u003c- list(model=model)\n\n  if(!is.null(x)){\n    x \u003c- utils::head(c(x, rep(NA_character_, times=4)), 5) # pad it with NAs\n    x_mat \u003c- hex2rgba(x)\n    x_lst \u003c- lapply(seq_len(nrow(x_mat)), function(i) if(x_mat[i,4]==0) \"N\" else x_mat[i,1:3])\n    lst \u003c- c(list(input=x_lst), lst)\n  }\n  jsonlite::toJSON(lst, auto_unbox = TRUE)\n}\n```\n\nNow all we have to do is to \"wrap\" existing function in the `politely` adverb. Then call the new function insted of original. You dont need to change anything other than a function name.\n\n```{r}\npolite_GET \u003c- politely(httr::GET, verbose=TRUE) \n\n#res \u003c- httr::GET(\"http://colormind.io/list\") # was\nres \u003c- polite_GET(\"http://colormind.io/list\") # now\njsonlite::fromJSON(httr::content(res, as = \"text\"))$result\n```\n\nThe backend functionality of `polite` can be used for *any* function as long as it has `url` argument (or the first argument is a url). Here's an example of polite POST created with adverb `politely`.\n\n```{r}\npolite_POST \u003c- politely(POST, verbose=TRUE) \n\nclue_colors \u003c-c(NA, \"lightseagreen\", NA, \"coral\", NA)\n\nreq \u003c- prepare_colormind_query(clue_colors, \"default\")\n\n#res \u003c- httr::POST(url='http://colormind.io/api/', body = req) #was\nres \u003c- polite_POST(url='http://colormind.io/api/', body = req) #now\nres_json \u003c- httr::content(res, as = \"text\")\nres_mcol \u003c- jsonlite::fromJSON(res_json)$result\ncolrs \u003c- rgba2hex(res_mcol)\nscales::show_col(colrs, ncol = 5)\n```\n\n### Querying musicbrainz API with polite backend\n\n[Musicbrainz API](https://musicbrainz.org/doc/MusicBrainz_API) allows querying data on artists, releases, labels and all things music. API endpoint, unfortunately, is Disallowed in `robots.txt`, but it is completely legal to access for small size requests. Mass querying is easier using a datadump, with musicbrainz published periodically. We can create polite GET and turn off `robots.txt` validation. \n\n```{r}\nlibrary(polite)\npolite_GET_nrt \u003c- politely(GET, verbose=TRUE, robots = FALSE) # turn off robotstxt checking\n\nbeatles_lst \u003c- polite_GET_nrt(\"https://musicbrainz.org/ws/2/artist/\", \n                   query=list(query=\"Beatles\", limit=10),\n                   httr::accept(\"application/json\")) %\u003e% \n  httr::content(type = \"application/json\")\nstr(beatles_lst, max.level = 2)\n```\n\nLets parse the response\n\n```{r}\noptions(knitr.kable.NA = '')\nbeatles_lst %\u003e%   \n  extract2(\"artists\") %\u003e% \n  {tibble::tibble(id=map_chr(.,\"id\", .default=NA_character_),\n                  match_pct=map_int(.,\"score\", .default=NA_character_),\n                  type=map_chr(.,\"type\", .default=NA_character_),\n                  name=map_chr(., \"name\", .default=NA_character_),\n                  country=map_chr(., \"country\", .default=NA_character_),\n                  lifespan_begin=map_chr(., c(\"life-span\", \"begin\"),.default=NA_character_),\n                  lifespan_end=map_chr(., c(\"life-span\", \"end\"),.default=NA_character_)\n                  )\n    } %\u003e% knitr::kable(col.names = c(id=\"Musicbrainz ID\", match_pct=\"Match, %\", \n                                     type=\"Type\", name=\"Name of artist\",\n                                     country=\"Country\", lifespan_begin=\"Career begun\",\n                                     lifespan_end=\"Career ended\")) \n```\n\n## Learn more\n\n[Ethical webscraper manifesto](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01)\n\nPackage logo uses elements of a free image by [pngtree.com](https://pngtree.com)\n\n[1] Wiktionary (2018), The free dictionary, retrieved from https://en.wiktionary.org/wiki/bow_and_scrape\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmi3kno%2Fpolite","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdmi3kno%2Fpolite","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmi3kno%2Fpolite/lists"}