https://github.com/coolbutuseless/fugly

Extract named substrings using named capture groups in regular expressions.
https://github.com/coolbutuseless/fugly
Last synced: about 1 month ago
JSON representation
Extract named substrings using named capture groups in regular expressions.
Host: GitHub
URL: https://github.com/coolbutuseless/fugly
Owner: coolbutuseless
License: mit
Created: 2021-03-19T12:43:21.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2021-04-28T13:42:42.000Z (almost 4 years ago)
Last Synced: 2025-03-18T01:11:19.793Z (about 1 month ago)
Language: R
Size: 324 KB
Stars: 34
Watchers: 2
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project

jimsghstars - coolbutuseless/fugly - Extract named substrings using named capture groups in regular expressions. (R)
README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = FALSE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

library(fugly)

```

```{r echo = FALSE, eval = FALSE}

# Quick logo generation. Borrowed heavily from Nick Tierney's Syn logo process

library(magick)

library(showtext)

font_add_google("Allura", "gf")

# pkgdown::build_site(override = list(destination = "../coolbutuseless.github.io/package/minipdf"))

```

```{r echo = FALSE, eval = FALSE}

img <- image_read("man/figures/white.png") #%>%

  # image_transparent(color = "#f9fafb", fuzz = 10) %>%

  # image_trim() %>%

  # image_threshold()

hexSticker::sticker(subplot  = img,

                    s_x      = 0.92,

                    s_y      = 1.2,

                    s_width  = 1.5,

                    s_height = 0.95,

                    package  = "fugly",

                    p_x      = 1,

                    p_y      = 1.15,

                    p_color  = "#223344",

                    p_family = "gf",

                    p_size   = 23,

                    h_size   = 1.2,

                    h_fill   = "#ffffff",

                    h_color  = "#223344",

                    filename = "man/figures/logo.png")

image_read("man/figures/logo.png")

```

# fugly 

![](https://img.shields.io/badge/cool-useless-green.svg)

This package provides a single function (`str_capture`) for using named capture 

groups to extract values from strings. A key requirement for readability is that 

the names of the capture groups are specified inline as part of the regex, 

and not in an external vector or as separate names.

`fugly::str_capture()` is implemented as a wrapper around 

[stringr](https://cran.r-project.org/package=stringr). This is because `stringr` 

itself does not yet do named capture groups (See issues for 

[stringr](https://github.com/tidyverse/stringr/issues/71) and 

[stringi](https://github.com/gagolews/stringi/issues/153)).

`fugly::str_capture()` is very similar to a number of existing packages. See

table below for a comparison.

| Method                      | Speed    | Inline capture group naming | robust |

|-----------------------------|----------|-----------------------------|--------|

| `fugly::str_capture`        | Fast     | Yes                         | No     |

| `rr4r::rr4r_extract_groups` | Fast     | Yes                         | Yes    |

| `nc::capture_first_vec`     | Fast     | No                          | Yes    |

| `tidy::extract`             | Fast     | No                          | Yes    |

| `utils::strcapture`         | Middling | No                          | Yes    |

| `unglue::unglue`            | Slow     | Yes                         | Yes    |

| `ore::ore_search`           | Slow     | Yes                         | Yes    |

### What do I mean when I say `fugly::str_capture()` is unsafe/dodgy/non-robust?

-   It doesn't adhere to standard regular expression syntax for named capture groups as used in perl, python etc.

-   It doesn't really adhere to `glue` syntax (although it looks similar at a surface level).

-   If you specify delimiters which appear in your string input, then you're going to have a bad time.

-   It's generally only been tested on data which is:

    -   highly structured

    -   only ASCII

    -   non-pathological

### What's in the box?

-   `fugly::str_capture(string, pattern, delim)`

    -   capture named groups with regular expressions

    -   returns a data.frame with all columns containing character strings

    -   can mix-and-match with non-capturing regular expressions

    -   if no regular expression specified for a named group then `.*?` is used.

    -   does not do any type guessing/conversion.

## Installation

You can install from [GitHub](https://github.com/coolbutuseless/fugly) with:

``` r

# install.package('remotes')

remotes::install_github('coolbutuseless/fugly')

```

## Example 1

In the following example:

-   Input consists of multiple strings

-   capture groups are delimited by `{}` by default.

-   the regex for the capture group for `name` is unspecified, so `.*?` will be used

-   the regex for the capture group for `age` is `\d+` i.e. match must consist of 1-or-more digits

```{r example}

library(fugly)

string <- c(

  "information: Name:greg Age:27 ",

  "information: Name:mary Age:34 "

)

str_capture(string, pattern = "Name:{name} Age:{age=\\d+}")

```

## Example 2

A more complicated example:

-   Note the mixture of capturing groups and a bare `.*?` in the pattern which is not returned as a result

```{r}

string <- c(

'{"type":"Feature","properties":{"hash":"1348778913c0224a","number":"27","street":"BANAMBILA STREET","unit":"","city":"ARANDA","district":"","region":"ACT","postcode":"2614","id":"GAACT714851647"},"geometry":{"type":"Point","coordinates":[149.0826143,-35.2545558]}}',

'{"type":"Feature","properties":{"hash":"dc776871c868bc7e","number":"139","street":"BOUVERIE STREET","unit":"UNIT 711","city":"CARLTON","district":"","region":"VIC","postcode":"3053","id":"GAVIC423944917"},"geometry":{"type":"Point","coordinates":[144.9617149,-37.8032551]}}',

'{"type":"Feature","properties":{"hash":"8197f34a40ccad47","number":"6","street":"MOGRIDGE STREET","unit":"","city":"WARWICK","district":"","region":"QLD","postcode":"4370","id":"GAQLD155949502"},"geometry":{"type":"Point","coordinates":[152.0230999,-28.2230133]}}',

'{"type":"Feature","properties":{"hash":"18edc96308fc1a8e","number":"22","street":"ORR STREET","unit":"UNIT 507","city":"CARLTON","district":"","region":"VIC","postcode":"3053","id":"GAVIC424282716"},"geometry":{"type":"Point","coordinates":[144.9653484,-37.8063371]}}'

)

str_capture(string, pattern = '"number":"{number}","street":"{street}".*?"coordinates":\\[{coords}\\]')

```

## Simple Benchmark

I acknowledge that this isn't the greatest benchmark, but it is relevant to my current use-case.

-   [nc](https://github.com/tdhock/nc) with the PCRE regex engine is the fastest named capture I could find in R.

    -   However - I'm not a huge fan of its syntax

-   For large inputs (1000+ input strings), `fugly` is significantly faster than `unglue`, `utils::strcapture` and \``ore`

-   The rust regex engine [rr4r](https://github.com/yutannihilation/rr4r) is slightly faster than `fugly`

-   `unglue` is the slowest of the methods.

-   `ore` lies somewhere between `unglue` and `utils::strcapture`

-   As pointed out by [Michael Barrowman](https://twitter.com/MyKo101AB), `tidyr::extract()` will also do named capture into a data.frame.

    -   Similar to `utils::strcapture()`, the names are not specified inline with the regex, but are listed separately.

```{r warning=FALSE, message=FALSE}

# remotes::install_github("jonclayden/ore")

# remotes::install_github("yutannihilation/rr4r")

# remotes::install_github('qinwf/re2r') 

library(ore)

library(rr4r)

library(unglue)

library(ggplot2)

library(tidyr)

# meaningless strings for benchmarking

N <- 1000

string <- paste0("Information name:greg age:", seq(N))

res <- bench::mark(

  `fugly::str_capture()` = fugly::str_capture(string, "name:{name} age:{age=\\d+}"),

  `unglue::unglue()` = unglue::unglue_data(string, "Information name:{name} age:{age=\\d+}"),

  `utils::strcapture()` = utils::strcapture("Information name:(.*?) age:(\\d+)", string, 

                    proto = data.frame(name=character(), age=character())),

  `ore::ore_search()` = do.call(rbind.data.frame, lapply(ore_search(ore('name:(?.*?) age:(?\\d+)', encoding='utf8'), string, all=TRUE), function(x) {x$groups$matches})),

   `rr4r::rr4r_extract_groups()` = rr4r::rr4r_extract_groups(string, "name:(?P.*?) age:(?P\\d+)"),

  `nc::capture_first_vec() PCRE` = nc::capture_first_vec(string, "Information name:", name=".*?", " age:", age="\\d+", engine = 'PCRE'),

  `tidyr::extract()` = tidyr::extract(data.frame(x = string), x, into = c('name', 'age'), regex = 'name:(.*?) age:(\\d+)'),

  check = FALSE

)

```

```{r echo=FALSE}

plot(res) + 

  theme_bw() + 

  theme(legend.position = 'bottom')

```

## Related Software

-   [stringr](https://cran.r-project.org/package=stringr)

-   `utils::strcapture()`

-   [unglue::unglue()](%5Bunglue%5D(https://cran.r-project.org/web/packages/unglue/index.html))

-   [ore](https://github.com/jonclayden/ore), [ore on CRAN](https://cran.r-project.org/package=ore)

-   [namedCapture](https://cran.r-project.org/web/packages/namedCapture/index.html) Note: I couldn't get this to work sanely.

-   [rr4f](https://github.com/yutannihilation/rr4r) rust regex engine

-   [nc](https://github.com/tdhock/nc)

## Acknowledgements

-   R Core for developing and maintaining the language.

-   CRAN maintainers, for patiently shepherding packages onto CRAN and maintaining the repository
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/coolbutuseless/fugly

Awesome Lists containing this project

README