Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hrbrmstr/spiderbar
Lightweight R wrapper around rep-cpp for robot.txt (Robots Exclusion Protocol) parsing and path testing in R
https://github.com/hrbrmstr/spiderbar
r r-cyber robots-exclusion-protocol robots-txt rstats
Last synced: 7 days ago
JSON representation
Lightweight R wrapper around rep-cpp for robot.txt (Robots Exclusion Protocol) parsing and path testing in R
- Host: GitHub
- URL: https://github.com/hrbrmstr/spiderbar
- Owner: hrbrmstr
- License: other
- Created: 2017-08-14T19:01:10.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2024-07-26T17:37:09.000Z (3 months ago)
- Last Synced: 2024-10-12T21:23:31.815Z (22 days ago)
- Topics: r, r-cyber, robots-exclusion-protocol, robots-txt, rstats
- Language: C++
- Size: 88.9 KB
- Stars: 10
- Watchers: 4
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - hrbrmstr/spiderbar - Lightweight R wrapper around rep-cpp for robot.txt (Robots Exclusion Protocol) parsing and path testing in R (C++)
README
---
output:
rmarkdown::github_document:
df_print: kable
editor_options:
chunk_output_type: console
---
```{r pkg-knitr-opts, include=FALSE}
hrbrpkghelpr::global_opts()
``````{r badges, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::stinking_badges()
``````{r description, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::yank_title_and_description()
```- [`rep-cpp`](https://github.com/seomoz/rep-cpp)
- [`url-cpp`](https://github.com/seomoz/url-cpp)## What's Inside the Tin
The following functions are implemented:
```{r ingredients, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::describe_ingredients()
```## Installation
```{r install-ex, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::install_block()
```## Usage
```{r message=FALSE, warning=FALSE, error=FALSE}
library(spiderbar)
library(robotstxt)# current verison
packageVersion("spiderbar")# use helpers from the robotstxt package
rt <- robxp(get_robotstxt("https://cdc.gov"))
print(rt)
# or
rt <- robxp(url("https://cdc.gov/robots.txt"))
can_fetch(rt, "/asthma/asthma_stats/default.htm", "*")
can_fetch(rt, "/_borders", "*")
gh_rt <- robxp(robotstxt::get_robotstxt("github.com"))
can_fetch(gh_rt, "/humans.txt", "*") # TRUE
can_fetch(gh_rt, "/login", "*") # FALSE
can_fetch(gh_rt, "/oembed", "CCBot") # FALSE
can_fetch(gh_rt, c("/humans.txt", "/login", "/oembed"))
crawl_delays(gh_rt)
imdb_rt <- robxp(robotstxt::get_robotstxt("imdb.com"))
crawl_delays(imdb_rt)
sitemaps(imdb_rt)
```## spiderbar Metrics
```{r cloc, echo=FALSE}
cloc::cloc_pkg_md()
```## Code of Conduct
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.