Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/coolbutuseless/fugly
Extract named substrings using named capture groups in regular expressions.
https://github.com/coolbutuseless/fugly
Last synced: 16 days ago
JSON representation
Extract named substrings using named capture groups in regular expressions.
- Host: GitHub
- URL: https://github.com/coolbutuseless/fugly
- Owner: coolbutuseless
- License: mit
- Created: 2021-03-19T12:43:21.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-04-28T13:42:42.000Z (over 3 years ago)
- Last Synced: 2024-10-12T21:24:57.350Z (about 1 month ago)
- Language: R
- Size: 324 KB
- Stars: 34
- Watchers: 3
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - coolbutuseless/fugly - Extract named substrings using named capture groups in regular expressions. (R)
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = FALSE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)library(fugly)
``````{r echo = FALSE, eval = FALSE}
# Quick logo generation. Borrowed heavily from Nick Tierney's Syn logo process
library(magick)
library(showtext)
font_add_google("Allura", "gf")# pkgdown::build_site(override = list(destination = "../coolbutuseless.github.io/package/minipdf"))
``````{r echo = FALSE, eval = FALSE}
img <- image_read("man/figures/white.png") #%>%
# image_transparent(color = "#f9fafb", fuzz = 10) %>%
# image_trim() %>%
# image_threshold()hexSticker::sticker(subplot = img,
s_x = 0.92,
s_y = 1.2,
s_width = 1.5,
s_height = 0.95,
package = "fugly",
p_x = 1,
p_y = 1.15,
p_color = "#223344",
p_family = "gf",
p_size = 23,
h_size = 1.2,
h_fill = "#ffffff",
h_color = "#223344",
filename = "man/figures/logo.png")image_read("man/figures/logo.png")
```# fugly
![](https://img.shields.io/badge/cool-useless-green.svg)
This package provides a single function (`str_capture`) for using named capture
groups to extract values from strings. A key requirement for readability is that
the names of the capture groups are specified inline as part of the regex,
and not in an external vector or as separate names.`fugly::str_capture()` is implemented as a wrapper around
[stringr](https://cran.r-project.org/package=stringr). This is because `stringr`
itself does not yet do named capture groups (See issues for
[stringr](https://github.com/tidyverse/stringr/issues/71) and
[stringi](https://github.com/gagolews/stringi/issues/153)).`fugly::str_capture()` is very similar to a number of existing packages. See
table below for a comparison.| Method | Speed | Inline capture group naming | robust |
|-----------------------------|----------|-----------------------------|--------|
| `fugly::str_capture` | Fast | Yes | No |
| `rr4r::rr4r_extract_groups` | Fast | Yes | Yes |
| `nc::capture_first_vec` | Fast | No | Yes |
| `tidy::extract` | Fast | No | Yes |
| `utils::strcapture` | Middling | No | Yes |
| `unglue::unglue` | Slow | Yes | Yes |
| `ore::ore_search` | Slow | Yes | Yes |### What do I mean when I say `fugly::str_capture()` is unsafe/dodgy/non-robust?
- It doesn't adhere to standard regular expression syntax for named capture groups as used in perl, python etc.
- It doesn't really adhere to `glue` syntax (although it looks similar at a surface level).
- If you specify delimiters which appear in your string input, then you're going to have a bad time.
- It's generally only been tested on data which is:
- highly structured
- only ASCII
- non-pathological### What's in the box?
- `fugly::str_capture(string, pattern, delim)`
- capture named groups with regular expressions
- returns a data.frame with all columns containing character strings
- can mix-and-match with non-capturing regular expressions
- if no regular expression specified for a named group then `.*?` is used.
- does not do any type guessing/conversion.## Installation
You can install from [GitHub](https://github.com/coolbutuseless/fugly) with:
``` r
# install.package('remotes')
remotes::install_github('coolbutuseless/fugly')
```## Example 1
In the following example:
- Input consists of multiple strings
- capture groups are delimited by `{}` by default.
- the regex for the capture group for `name` is unspecified, so `.*?` will be used
- the regex for the capture group for `age` is `\d+` i.e. match must consist of 1-or-more digits```{r example}
library(fugly)string <- c(
"information: Name:greg Age:27 ",
"information: Name:mary Age:34 "
)str_capture(string, pattern = "Name:{name} Age:{age=\\d+}")
```## Example 2
A more complicated example:
- Note the mixture of capturing groups and a bare `.*?` in the pattern which is not returned as a result
```{r}
string <- c(
'{"type":"Feature","properties":{"hash":"1348778913c0224a","number":"27","street":"BANAMBILA STREET","unit":"","city":"ARANDA","district":"","region":"ACT","postcode":"2614","id":"GAACT714851647"},"geometry":{"type":"Point","coordinates":[149.0826143,-35.2545558]}}',
'{"type":"Feature","properties":{"hash":"dc776871c868bc7e","number":"139","street":"BOUVERIE STREET","unit":"UNIT 711","city":"CARLTON","district":"","region":"VIC","postcode":"3053","id":"GAVIC423944917"},"geometry":{"type":"Point","coordinates":[144.9617149,-37.8032551]}}',
'{"type":"Feature","properties":{"hash":"8197f34a40ccad47","number":"6","street":"MOGRIDGE STREET","unit":"","city":"WARWICK","district":"","region":"QLD","postcode":"4370","id":"GAQLD155949502"},"geometry":{"type":"Point","coordinates":[152.0230999,-28.2230133]}}',
'{"type":"Feature","properties":{"hash":"18edc96308fc1a8e","number":"22","street":"ORR STREET","unit":"UNIT 507","city":"CARLTON","district":"","region":"VIC","postcode":"3053","id":"GAVIC424282716"},"geometry":{"type":"Point","coordinates":[144.9653484,-37.8063371]}}'
)str_capture(string, pattern = '"number":"{number}","street":"{street}".*?"coordinates":\\[{coords}\\]')
```
## Simple Benchmark
I acknowledge that this isn't the greatest benchmark, but it is relevant to my current use-case.
- [nc](https://github.com/tdhock/nc) with the PCRE regex engine is the fastest named capture I could find in R.
- However - I'm not a huge fan of its syntax
- For large inputs (1000+ input strings), `fugly` is significantly faster than `unglue`, `utils::strcapture` and \``ore`
- The rust regex engine [rr4r](https://github.com/yutannihilation/rr4r) is slightly faster than `fugly`
- `unglue` is the slowest of the methods.
- `ore` lies somewhere between `unglue` and `utils::strcapture`
- As pointed out by [Michael Barrowman](https://twitter.com/MyKo101AB), `tidyr::extract()` will also do named capture into a data.frame.
- Similar to `utils::strcapture()`, the names are not specified inline with the regex, but are listed separately.
```{r warning=FALSE, message=FALSE}
# remotes::install_github("jonclayden/ore")
# remotes::install_github("yutannihilation/rr4r")
# remotes::install_github('qinwf/re2r')
library(ore)
library(rr4r)
library(unglue)
library(ggplot2)
library(tidyr)# meaningless strings for benchmarking
N <- 1000
string <- paste0("Information name:greg age:", seq(N))res <- bench::mark(
`fugly::str_capture()` = fugly::str_capture(string, "name:{name} age:{age=\\d+}"),
`unglue::unglue()` = unglue::unglue_data(string, "Information name:{name} age:{age=\\d+}"),
`utils::strcapture()` = utils::strcapture("Information name:(.*?) age:(\\d+)", string,
proto = data.frame(name=character(), age=character())),
`ore::ore_search()` = do.call(rbind.data.frame, lapply(ore_search(ore('name:(?.*?) age:(?\\d+)', encoding='utf8'), string, all=TRUE), function(x) {x$groups$matches})),
`rr4r::rr4r_extract_groups()` = rr4r::rr4r_extract_groups(string, "name:(?P.*?) age:(?P\\d+)"),
`nc::capture_first_vec() PCRE` = nc::capture_first_vec(string, "Information name:", name=".*?", " age:", age="\\d+", engine = 'PCRE'),
`tidyr::extract()` = tidyr::extract(data.frame(x = string), x, into = c('name', 'age'), regex = 'name:(.*?) age:(\\d+)'),
check = FALSE
)
``````{r echo=FALSE}
plot(res) +
theme_bw() +
theme(legend.position = 'bottom')
```## Related Software
- [stringr](https://cran.r-project.org/package=stringr)
- `utils::strcapture()`
- [unglue::unglue()](%5Bunglue%5D(https://cran.r-project.org/web/packages/unglue/index.html))
- [ore](https://github.com/jonclayden/ore), [ore on CRAN](https://cran.r-project.org/package=ore)
- [namedCapture](https://cran.r-project.org/web/packages/namedCapture/index.html) Note: I couldn't get this to work sanely.
- [rr4f](https://github.com/yutannihilation/rr4r) rust regex engine
- [nc](https://github.com/tdhock/nc)## Acknowledgements
- R Core for developing and maintaining the language.
- CRAN maintainers, for patiently shepherding packages onto CRAN and maintaining the repository