An open API service indexing awesome lists of open source software.

https://github.com/dyfanjones/urlparse

Fast and simple url parser for R
https://github.com/dyfanjones/urlparse

cpp r url url-parser urlparser

Last synced: about 2 months ago
JSON representation

Fast and simple url parser for R

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# urlparse

[![CRAN status](https://www.r-pkg.org/badges/version/urlparse)](https://CRAN.R-project.org/package=urlparse)
[![R-CMD-check](https://github.com/DyfanJones/urlparse/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/DyfanJones/urlparse/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/DyfanJones/urlparse/graph/badge.svg)](https://app.codecov.io/gh/DyfanJones/urlparse)
[![urlparse status badge](https://dyfanjones.r-universe.dev/urlparse/badges/version)](https://dyfanjones.r-universe.dev/urlparse)

Fast and simple url parser for R. Initially developed for the `paws.common` package.

```{r}
urlparse::url_parse("https://user:[email protected]:8000/path?query=1#fragment")
```

## Installation

You can install the development version of urlparse like so:

``` r
remotes::install_github("dyfanjones/urlparse")
```

r-universe installation:

```r
install.packages("urlparse", repos = c("https://dyfanjones.r-universe.dev", "https://cloud.r-project.org"))
```

## Example

This is a basic example which shows you how to solve a common problem:

```{r example}
library(urlparse)
```

```{r encode}
url_encoder("foo = bar + 5")

url_decoder(url_encoder("foo = bar + 5"))
```

Similar to python's `from urllib.parse import quote`, `urlparse::url_encoder` supports the `safe` parameter. The additional ASCII characters that should not be encoded.

```{python python_encode_safe}
from urllib.parse import quote
quote("foo = bar + 5", safe = "+")
```
```{r r_encode_safe}
url_encoder("foo = bar + 5", safe = "+")
```

Modify an `url` through piping using the `set_*` functions or using the stand alone `url_modify` function.

```{r url_modify}

url <- "http://example.com"
set_scheme(url, "https") |>
set_port(1234L) |>
set_path("foo/bar") |>
set_query("baz") |>
set_fragment("quux")

url_modify(url, scheme = "https", port = 1234, path = "foo/bar", query = "baz", fragment = "quux")
```

Note: it is faster to use `url_modify` rather than piping the `set_*` functions. This is because `urlparse` has to parse the url within each `set_*` to modify the url.

```{r url_mod_bench}
url <- "http://example.com"
bench::mark(
piping = {set_scheme(url, "https") |>
set_port(1234L) |>
set_path("foo/bar") |>
set_query("baz") |>
set_fragment("quux")},
single_function = url_modify(url, scheme = "https", port = 1234, path = "foo/bar", query = "baz", fragment = "quux")
)
```

## Benchmark:

```{r, echo = FALSE}
show_relative <- function(bm) {
summary_cols <- c("min", "median", "itr/sec", "mem_alloc", "gc/sec")
bm[summary_cols] <- lapply(bm[summary_cols], function(x) as.numeric(x / min(x)))
return(bm)
}
```

### Parsing URL:
```{r benchmark}
url <- "https://user:[email protected]:8000/path?query=1#fragment"
(bm <- bench::mark(
urlparse = urlparse::url_parse(url),
httr2 = httr2::url_parse(url),
curl = curl::curl_parse_url(url),
urltools = urltools::url_parse(url),
check = F
))

show_relative(bm)

ggplot2::autoplot(bm)
```

Since `urlpase v0.1.999+` you can use the vectorised url parser `url_parser_v2`
```{r benchmark_vectorise}
urls <- c(
"https://www.example.com",
"https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519",
"https://user_1:[email protected]:8080/dir/../api?q=1#frag",
"https://user:[email protected]",
"https://www.example.com:8080/search%3D1%2B3",
"https://www.google.co.jp/search?q=\u30c9\u30a4\u30c4",
"https://www.example.com:8080?var1=foo&var2=ba%20r&var3=baz+larry",
"https://user:[email protected]:8080",
"https://user:[email protected]",
"https://[email protected]:8080",
"https://[email protected]"
)
(bm <- bench::mark(
urlparse = lapply(urls, urlparse::url_parse),
urlparse_v2 = urlparse::url_parse_v2(urls),
httr2 = lapply(urls, httr2::url_parse),
curl = lapply(urls, curl::curl_parse_url),
urltools = urltools::url_parse(urls),
check = F
))

show_relative(bm)

ggplot2::autoplot(bm)
```

Note: `url_parse_v2` returns the parsed url as a `data.frame` this is similar behaviour to `urltools` and `adaR`:

```{r url_parse_v2}
urlparse::url_parse_v2(urls)
```

### Encoding URL:

Note: `urltools` encode special characters to lower case hex i.e.: "?" -> "%3f" instead of "%3F"

```{r benchmark_encode_small}
string <- "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~`!@#$%^&*()=+[{]}\\|;:'\",<>/? "
(bm <- bench::mark(
urlparse = urlparse::url_encoder(string),
curl = curl::curl_escape(string),
urltools = urltools::url_encode(string),
base = URLencode(string, reserved = T),
check = F
))

show_relative(bm)

ggplot2::autoplot(bm)
```

```{r benchmark_encode_large}
string <- "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~`!@#$%^&*()=+[{]}\\|;:'\",<>/? "
url <- paste0(sample(strsplit(string, "")[[1]], 1e4, replace = TRUE), collapse = "")
(bm <- bench::mark(
urlparse = urlparse::url_encoder(url),
curl = curl::curl_escape(url),
urltools = urltools::url_encode(url),
base = URLencode(url, reserved = T, repeated = T),
check = F,
filter_gc = F
))

show_relative(bm)

ggplot2::autoplot(bm)
```