An open API service indexing awesome lists of open source software.

https://github.com/coolbutuseless/oomph

Faster subsetting of vectors and lists by name
https://github.com/coolbutuseless/oomph

package r rpackage

Last synced: 9 months ago
JSON representation

Faster subsetting of vectors and lists by name

Awesome Lists containing this project

README

          

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = FALSE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)

library(oomph)
```

# oomph

![](https://img.shields.io/badge/cool-useless-green.svg)
[![R-CMD-check](https://github.com/coolbutuseless/oomph/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/coolbutuseless/oomph/actions/workflows/R-CMD-check.yaml)

`oomph` is a package for fast string matching within a static set of strings.
This is useful for fast named look-up in fixed lists and vectors.

Internally this uses a hash-map in C to map strings to integers. In R,
this appears as a minimal perfect hash where each string maps to its index,
and unknown strings return `NA`

The hashed look-up can be more than **1000x** faster than R's standard look-up method (depending on
number of elements in original object and the number of elements to extract).

## What's in the box

* `mph <- mph_init(s, size_factor)` initialise a hash with the given set of strings
* Using a larger `size_factor` (than the default of `1`) decreases the number
of hash collisions, and can make other operations faster at the cost or
more memory being allocated.
* `mph_match(s, mph)` find the indices of the strings `s` (equivalent to R's `match()`)

## Installation

You can install from [GitHub](https://github.com/coolbutuseless/oomph) with:

``` r
# install.package('remotes')
remotes::install_github('coolbutuseless/oomph')
```

## Setup test data

```{r}
library(oomph)
N <- 500000
set.seed(1)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 500k random names
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
nms <- vapply(seq(N), \(i) paste(sample(c(letters, LETTERS, 0:9), 10, T), collapse = ""), character(1))
head(nms)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# A big named vector and named list (each with 500k elements)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
big_vector <- seq(N)
big_list <- as.list(seq(N))

names(big_vector) <- nms
names(big_list ) <- nms

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Probe sets to use for testing
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
t0 <- sample(nms, 1, replace = TRUE)
t1 <- sample(nms, 10, replace = TRUE)
t2 <- sample(nms, 100, replace = TRUE)
t3 <- sample(nms, 1000, replace = TRUE)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# By default, the number of hash buckets is the same as the number of
# strings. To reduce the possibility of hash collisions (and possibly make look-ups
# faster), the number of hash buckets can be changed using the 'size_factor'
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mph <- mph_init(nms) # Allocate exactly length(nms) buckets
```

## Compare `match()` with `mph_match()`

```{r match}
bench::mark(
match(t0, nms),
mph_match(t0, mph)
)[, 1:5] |> knitr::kable()

bench::mark(
match(t1, nms),
mph_match(t1, mph)
)[, 1:5] |> knitr::kable()

bench::mark(
match(t2, nms),
mph_match(t2, mph)
)[, 1:5] |> knitr::kable()

bench::mark(
match(t3, nms),
mph_match(t3, mph)
)[, 1:5] |> knitr::kable()
```

## Vector subsetting - Extract 100 elements of a `vector` by name

```{r vector-subset}
bench::mark(
big_vector[t2],
big_vector[mph_match(t2, mph)]
)[, 1:5] |> knitr::kable()
```

## List subsetting - Extract 100 elements of a `list` by name

Also compare to using hashed named lookup in a standard R environment

```{r list-subset}
ee <- as.environment(big_list)

bench::mark(
`Standard R` = big_list[t2],
`R hashed environment` = mget(t2, ee),
`[] and mph indexing` = big_list[mph_match(t2, mph)]
)[, 1:5] |> knitr::kable()
```

## Time taken to build the hash

```{r}
set.seed(1)
chrs <- c(letters, LETTERS, 0:9)
N <- 1000
nms1k <- vapply(seq(N), \(i) paste(sample(chrs, 10, T), collapse = ""), character(1))

N <- 10000
nms10k <- vapply(seq(N), \(i) paste(sample(chrs, 10, T), collapse = ""), character(1))

N <- 100000
nms100k <- vapply(seq(N), \(i) paste(sample(chrs, 10, T), collapse = ""), character(1))

bench::mark(
mph_init(nms1k),
mph_init(nms10k),
mph_init(nms100k),
check = FALSE
)[, 1:5] |> knitr::kable()

```

## Billion Row Challenge indexing

The following example is a part of the [billion row challenge](https://github.com/jrosell/1br).

In this example, we are attempting to keep a streaming tally of the 3-letter codes
which are seen.

```{r warning=FALSE}
library(oomph)
library(insitu)

nms <- expand.grid(LETTERS, LETTERS, LETTERS) |>
apply(1, paste0, collapse = "")

counts <- numeric(length(nms))
names(counts) <- nms
mph <- mph_init(nms)

set.seed(1)
random_nms <- sample(nms, 1000)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# updating in bulk
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bench::mark(
baseR = {i <- match(random_nms, nms); counts[i] <- counts[i] + 1},
oomph = {i <- mph_match(random_nms, mph); counts[i] <- counts[i] + 1},
`oomph + insitu` = {br_add(counts, 1, idx = mph_match(random_nms, mph))},
check = FALSE
)[, 1:5] |> knitr::kable()

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Updating within a for loop
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bench::mark(
baseR = {
for (nm in random_nms) {
i <- match(nm, nms)
counts[i] <- counts[i] + 1
}
},
oomph = {
for (nm in random_nms) {
i <- mph_match(nm, mph)
counts[i] <- counts[i] + 1
}
},
`oomph + insitu` = {
for (nm in random_nms) {
br_add(counts, 1, idx = mph_match(nm, mph))
}
},
check = FALSE
)[, 1:5] |> knitr::kable()
```