{"id":14990035,"url":"https://github.com/hrbrmstr/curlparse","last_synced_at":"2025-04-12T02:03:57.481Z","repository":{"id":141238082,"uuid":"148158483","full_name":"hrbrmstr/curlparse","owner":"hrbrmstr","description":"📃Parse 'URLs' with 'libcurl'","archived":true,"fork":false,"pushed_at":"2024-10-26T09:38:43.000Z","size":467,"stargazers_count":16,"open_issues_count":1,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-12T02:03:49.277Z","etag":null,"topics":["libcurl","r","r-cyber","rstats","url-parse"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hrbrmstr.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-10T13:19:31.000Z","updated_at":"2025-03-22T11:03:17.000Z","dependencies_parsed_at":null,"dependency_job_id":"8d1128f0-c005-4055-a3db-5b8e4dd1a5d4","html_url":"https://github.com/hrbrmstr/curlparse","commit_stats":{"total_commits":13,"total_committers":1,"mean_commits":13.0,"dds":0.0,"last_synced_commit":"1abd213df6fc4ac003546a4733e029e87675e594"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fcurlparse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fcurlparse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fcurlparse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hrbrmstr%2Fcurlparse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hrbrmstr","download_url":"https://codeload.github.com/hrbrmstr/curlparse/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248505863,"owners_count":21115354,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["libcurl","r","r-cyber","rstats","url-parse"],"created_at":"2024-09-24T14:19:21.956Z","updated_at":"2025-04-12T02:03:57.449Z","avatar_url":"https://github.com/hrbrmstr.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: rmarkdown::github_document\neditor_options: \n  chunk_output_type: console\n---\n```{r pkg-knitr-opts, include=FALSE}\nhrbrpkghelpr::global_opts()\n```\n\n```{r badges, results='asis', echo=FALSE, cache=FALSE}\nhrbrpkghelpr::stinking_badges()\n```\n# curlparse\n\nParse 'URLs' with 'libcurl'\n\n## Description\n\nTools are provided to parse URLs using the modern 'libcurl' built-in parser.\n\n## NOTE\n\nYou _need_ to have `libcurl` \u003e= 7.62.0 for this to work since that's when it began to expose the URL parsing API. \n\nmacOS users can do:\n\n```\n$ brew install curl\n```\n\n(provided you're using [Homebrew](https://brew.sh/)).\n\nWindows users are able to just install this pacakge since it uses the same, clever \"anticonf\" that Jeroen uses in the [{curl} pacakge](https://github.com/jeroen/curl). \n\nThe state of the availability of `libcurl` v7.62.0 across Linux distributions is sketch at best (as an example, Ubuntu bionic and comic are not even remotely at the current version). If your distribution does not have \u003e= 7.62.0 available you will need to [compile and install it manually](https://curl.haxx.se/download.html) ensuring the library and headers are available to R to build the package.\n\n## What's Inside The Tin\n\nThe following functions are implemented:\n\n```{r ingredients, results='asis', echo=FALSE, cache=FALSE}\nhrbrpkghelpr::describe_ingredients()\n```\n\n## Installation\n\n```{r install-ex, results='asis', echo = FALSE}\nhrbrpkghelpr::install_block()\n```\n\n## Usage\n\n```{r message=FALSE, warning=FALSE, error=FALSE}\nlibrary(curlparse)\n\n# current verison\npackageVersion(\"curlparse\")\n\n```\n\n### Process Some URLs\n\n```{r libs}\nlibrary(urltools)\nlibrary(rvest)\nlibrary(curlparse)\nlibrary(tidyverse)\n\n```\n```{r cache=TRUE}\nread_html(\"https://www.r-bloggers.com/blogs-list/\") %\u003e% \n  html_nodes(xpath=\".//li[contains(., 'Contributing Blogs')]/ul/li/a[contains(@href, 'http')]\") %\u003e% \n  html_attr(\"href\") -\u003e blog_urls\n\n```\n```{r}\n(parsed \u003c- parse_curl(blog_urls))\n\ncount(parsed, scheme, sort=TRUE)\n\nfilter(parsed, !is.na(query))\n```\n\n### Benchmark\n\n`curlparse` includes a `url_parse()` function to make it easier to use this package for current users of `urltools::url_parse()` since it provides the same API and same results back (including it being a regular data frame and not a `tbl`). \n\nSpoiler alert: `urltools::url_parse()` is faster by ~100µs (per-100 URLs) for \"good\" URLs (if there's a mix of gnarly/bad URLs and valid ones they get closer to being on-par). The aim was not to try to beat it, though. \n\n\u003ePer the [blog post introducing this new set of API calls](https://daniel.haxx.se/blog/2018/09/09/libcurl-gets-a-url-api/):\n\u003e\n\u003eApplications that pass in URLs to libcurl would of course still very often need to parse URLs, create URLs or otherwise handle them, but libcurl has not been helping with that.\n\u003e\n\u003eAt the same time, the under-specification of URLs has led to a situation where there's really no stable document anywhere describing how URLs are supposed to work and basically every implementer is left to handle the WHATWG URL spec, RFC 3986 and the world in between all by themselves. Understanding how their URL parsing libraries, libcurl, other tools and their favorite browsers differ is complicated.\n\u003e\n\u003eBy offering applications access to libcurl's own URL parser, we hope to tighten a problematic vulnerable area for applications where the URL parser library would believe one thing and libcurl another. This could and has sometimes lead to security problems. (See for example Exploiting URL Parser in Trending Programming Languages! by Orange Tsai)\n\nSo, using this library adds consistency with how `libcurl` sees and handles URLs.\n\n```{r}\nlibrary(microbenchmark)\n\nset.seed(0)\ntest_urls \u003c- sample(blog_urls, 100) # pick 100 URLs at random\n\nmicrobenchmark(\n  curlparse = curlparse::url_parse(test_urls),\n  urltools = urltools::url_parse(test_urls), # we loaded urltools before curlparse at the top so namespace loading wasn't a factor for the benchmarks\n  times = 500\n) -\u003e mb\n\nmb\n\nautoplot(mb)\n```\n\nThe individual handlers are a bit more on-par but mostly still slower (except for `fragment()`). Note that `urltools` has no equivalent function to just extract query strings so that's not in the test.\n\n```{r fig.width=6, fig.height=6}\nbind_rows(\n  microbenchmark(curlparse = curlparse::scheme(blog_urls), urltools = urltools::scheme(blog_urls)) %\u003e%\n    mutate(test = \"scheme\"),\n  microbenchmark(curlparse = curlparse::domain(blog_urls), urltools = urltools::domain(blog_urls)) %\u003e%\n    mutate(test = \"domain\"),\n  microbenchmark(curlparse = curlparse::port(blog_urls), urltools = urltools::port(blog_urls)) %\u003e%\n    mutate(test = \"port\"),\n  microbenchmark(curlparse = curlparse::path(blog_urls), urltools = urltools::path(blog_urls)) %\u003e%\n    mutate(test = \"path\"),\n  microbenchmark(curlparse = curlparse::fragment(blog_urls), urltools = urltools::fragment(blog_urls)) %\u003e%\n    mutate(test = \"fragment\")\n) %\u003e% \n  mutate(test = factor(test, levels=c(\"scheme\", \"domain\", \"port\", \"path\", \"fragment\"))) %\u003e% \n  mutate(time = time / 1000000) %\u003e% \n  ggplot(aes(expr, time)) +\n  geom_violin(aes(fill=expr), show.legend = FALSE) +\n  scale_y_continuous(name = \"milliseconds\", expand = c(0,0), limits=c(0, NA)) +\n  hrbrthemes::scale_fill_ft() +\n  facet_wrap(~test, ncol = 1) +\n  coord_flip() +\n  labs(x=NULL) +\n  hrbrthemes::theme_ft_rc(grid=\"XY\", strip_text_face = \"bold\") +\n  theme(panel.spacing.y=unit(0, \"lines\"))\n```\n\n```{r echo=FALSE}\nunloadNamespace(\"urltools\")\n```\n\n### Stress Test\n\n```{r}\nc(\n  \"\", \"foo\", \"foo;params?query#fragment\", \"http://foo.com/path\", \"http://foo.com\",\n  \"//foo.com/path\", \"//user:pass@foo.com/\", \"http://user:pass@foo.com/\", \n  \"file:///tmp/junk.txt\", \"imap://mail.python.org/mbox1\",\n  \"mms://wms.sys.hinet.net/cts/Drama/09006251100.asf\", \"nfs://server/path/to/file.txt\",\n  \"svn+ssh://svn.zope.org/repos/main/ZConfig/trunk/\",\n  \"git+ssh://git@github.com/user/project.git\", \"HTTP://WWW.PYTHON.ORG/doc/#frag\",\n  \"http://www.python.org:080/\", \"http://www.python.org:/\", \"javascript:console.log('hello')\",\n  \"javascript:console.log('hello');console.log('world')\", \"http://example.com/?\", \n  \"http://example.com/;\", \"tel:0108202201\", \"unknown:0108202201\",\n  \"http://user@example.com:8080/path;param?query#fragment\", \n  \"http://www.python.org:65536/\", \"http://www.python.org:-20/\",\n  \"http://www.python.org:8589934592/\", \"http://www.python.org:80hello/\", \n  \"http://:::cnn.com/\", \"http://./\", \"http://foo..com/\", \"http://foo../\"\n) -\u003e ugly_urls\n\n(u_parsed \u003c- parse_curl(ugly_urls))\n\nfilter(u_parsed, !is.na(scheme))\n\nfilter(u_parsed, !is.na(user))\n\nfilter(u_parsed, !is.na(password))\n\nfilter(u_parsed, !is.na(host))\n\nfilter(u_parsed, !is.na(path))\n\nfilter(u_parsed, !is.na(query))\n\nfilter(u_parsed, !is.na(fragment))\n```\n\nMake sure the vector extractors work the same as the data frame converter:\n\n```{r}\nall(\n  c(\n    identical(u_parsed$scheme, scheme(ugly_urls)),\n    identical(u_parsed$user, user(ugly_urls)),\n    identical(u_parsed$password, password(ugly_urls)),\n    identical(u_parsed$host, host(ugly_urls)),\n    identical(u_parsed$path, path(ugly_urls)),\n    identical(u_parsed$query, query(ugly_urls)),\n    identical(u_parsed$fragment, fragment(ugly_urls))\n  )\n)\n```\n\n## curlparse Metrics\n\n```{r cloc, echo=FALSE}\ncloc::cloc_pkg_md()\n```\n\n## Code of Conduct\n\nPlease note that this project is released with a Contributor Code of Conduct.\nBy participating in this project you agree to abide by its terms.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Fcurlparse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhrbrmstr%2Fcurlparse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhrbrmstr%2Fcurlparse/lists"}