https://github.com/dyfanjones/urlparse
Fast and simple url parser for R
https://github.com/dyfanjones/urlparse
cpp r url url-parser urlparser
Last synced: about 2 months ago
JSON representation
Fast and simple url parser for R
- Host: GitHub
- URL: https://github.com/dyfanjones/urlparse
- Owner: DyfanJones
- License: other
- Created: 2025-01-06T17:08:13.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-02-06T14:05:42.000Z (3 months ago)
- Last Synced: 2025-02-06T14:12:11.537Z (3 months ago)
- Topics: cpp, r, url, url-parser, urlparser
- Language: C++
- Homepage: https://dyfanjones.r-universe.dev/urlparse
- Size: 728 KB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```# urlparse
[](https://CRAN.R-project.org/package=urlparse)
[](https://github.com/DyfanJones/urlparse/actions/workflows/R-CMD-check.yaml)
[](https://app.codecov.io/gh/DyfanJones/urlparse)
[](https://dyfanjones.r-universe.dev/urlparse)Fast and simple url parser for R. Initially developed for the `paws.common` package.
```{r}
urlparse::url_parse("https://user:[email protected]:8000/path?query=1#fragment")
```## Installation
You can install the development version of urlparse like so:
``` r
remotes::install_github("dyfanjones/urlparse")
```r-universe installation:
```r
install.packages("urlparse", repos = c("https://dyfanjones.r-universe.dev", "https://cloud.r-project.org"))
```## Example
This is a basic example which shows you how to solve a common problem:
```{r example}
library(urlparse)
``````{r encode}
url_encoder("foo = bar + 5")url_decoder(url_encoder("foo = bar + 5"))
```Similar to python's `from urllib.parse import quote`, `urlparse::url_encoder` supports the `safe` parameter. The additional ASCII characters that should not be encoded.
```{python python_encode_safe}
from urllib.parse import quote
quote("foo = bar + 5", safe = "+")
```
```{r r_encode_safe}
url_encoder("foo = bar + 5", safe = "+")
```Modify an `url` through piping using the `set_*` functions or using the stand alone `url_modify` function.
```{r url_modify}
url <- "http://example.com"
set_scheme(url, "https") |>
set_port(1234L) |>
set_path("foo/bar") |>
set_query("baz") |>
set_fragment("quux")url_modify(url, scheme = "https", port = 1234, path = "foo/bar", query = "baz", fragment = "quux")
```Note: it is faster to use `url_modify` rather than piping the `set_*` functions. This is because `urlparse` has to parse the url within each `set_*` to modify the url.
```{r url_mod_bench}
url <- "http://example.com"
bench::mark(
piping = {set_scheme(url, "https") |>
set_port(1234L) |>
set_path("foo/bar") |>
set_query("baz") |>
set_fragment("quux")},
single_function = url_modify(url, scheme = "https", port = 1234, path = "foo/bar", query = "baz", fragment = "quux")
)
```## Benchmark:
```{r, echo = FALSE}
show_relative <- function(bm) {
summary_cols <- c("min", "median", "itr/sec", "mem_alloc", "gc/sec")
bm[summary_cols] <- lapply(bm[summary_cols], function(x) as.numeric(x / min(x)))
return(bm)
}
```### Parsing URL:
```{r benchmark}
url <- "https://user:[email protected]:8000/path?query=1#fragment"
(bm <- bench::mark(
urlparse = urlparse::url_parse(url),
httr2 = httr2::url_parse(url),
curl = curl::curl_parse_url(url),
urltools = urltools::url_parse(url),
check = F
))show_relative(bm)
ggplot2::autoplot(bm)
```Since `urlpase v0.1.999+` you can use the vectorised url parser `url_parser_v2`
```{r benchmark_vectorise}
urls <- c(
"https://www.example.com",
"https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519",
"https://user_1:[email protected]:8080/dir/../api?q=1#frag",
"https://user:[email protected]",
"https://www.example.com:8080/search%3D1%2B3",
"https://www.google.co.jp/search?q=\u30c9\u30a4\u30c4",
"https://www.example.com:8080?var1=foo&var2=ba%20r&var3=baz+larry",
"https://user:[email protected]:8080",
"https://user:[email protected]",
"https://[email protected]:8080",
"https://[email protected]"
)
(bm <- bench::mark(
urlparse = lapply(urls, urlparse::url_parse),
urlparse_v2 = urlparse::url_parse_v2(urls),
httr2 = lapply(urls, httr2::url_parse),
curl = lapply(urls, curl::curl_parse_url),
urltools = urltools::url_parse(urls),
check = F
))show_relative(bm)
ggplot2::autoplot(bm)
```Note: `url_parse_v2` returns the parsed url as a `data.frame` this is similar behaviour to `urltools` and `adaR`:
```{r url_parse_v2}
urlparse::url_parse_v2(urls)
```### Encoding URL:
Note: `urltools` encode special characters to lower case hex i.e.: "?" -> "%3f" instead of "%3F"
```{r benchmark_encode_small}
string <- "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~`!@#$%^&*()=+[{]}\\|;:'\",<>/? "
(bm <- bench::mark(
urlparse = urlparse::url_encoder(string),
curl = curl::curl_escape(string),
urltools = urltools::url_encode(string),
base = URLencode(string, reserved = T),
check = F
))show_relative(bm)
ggplot2::autoplot(bm)
``````{r benchmark_encode_large}
string <- "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~`!@#$%^&*()=+[{]}\\|;:'\",<>/? "
url <- paste0(sample(strsplit(string, "")[[1]], 1e4, replace = TRUE), collapse = "")
(bm <- bench::mark(
urlparse = urlparse::url_encoder(url),
curl = curl::curl_escape(url),
urltools = urltools::url_encode(url),
base = URLencode(url, reserved = T, repeated = T),
check = F,
filter_gc = F
))show_relative(bm)
ggplot2::autoplot(bm)
```