https://github.com/coolbutuseless/oomph
Faster subsetting of vectors and lists by name
https://github.com/coolbutuseless/oomph
package r rpackage
Last synced: 9 months ago
JSON representation
Faster subsetting of vectors and lists by name
- Host: GitHub
- URL: https://github.com/coolbutuseless/oomph
- Owner: coolbutuseless
- License: other
- Created: 2024-12-19T05:40:26.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-07-10T06:42:44.000Z (11 months ago)
- Last Synced: 2025-07-10T10:10:09.318Z (11 months ago)
- Topics: package, r, rpackage
- Language: C
- Homepage:
- Size: 259 KB
- Stars: 22
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = FALSE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
library(oomph)
```
# oomph

[](https://github.com/coolbutuseless/oomph/actions/workflows/R-CMD-check.yaml)
`oomph` is a package for fast string matching within a static set of strings.
This is useful for fast named look-up in fixed lists and vectors.
Internally this uses a hash-map in C to map strings to integers. In R,
this appears as a minimal perfect hash where each string maps to its index,
and unknown strings return `NA`
The hashed look-up can be more than **1000x** faster than R's standard look-up method (depending on
number of elements in original object and the number of elements to extract).
## What's in the box
* `mph <- mph_init(s, size_factor)` initialise a hash with the given set of strings
* Using a larger `size_factor` (than the default of `1`) decreases the number
of hash collisions, and can make other operations faster at the cost or
more memory being allocated.
* `mph_match(s, mph)` find the indices of the strings `s` (equivalent to R's `match()`)
## Installation
You can install from [GitHub](https://github.com/coolbutuseless/oomph) with:
``` r
# install.package('remotes')
remotes::install_github('coolbutuseless/oomph')
```
## Setup test data
```{r}
library(oomph)
N <- 500000
set.seed(1)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 500k random names
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
nms <- vapply(seq(N), \(i) paste(sample(c(letters, LETTERS, 0:9), 10, T), collapse = ""), character(1))
head(nms)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# A big named vector and named list (each with 500k elements)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
big_vector <- seq(N)
big_list <- as.list(seq(N))
names(big_vector) <- nms
names(big_list ) <- nms
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Probe sets to use for testing
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
t0 <- sample(nms, 1, replace = TRUE)
t1 <- sample(nms, 10, replace = TRUE)
t2 <- sample(nms, 100, replace = TRUE)
t3 <- sample(nms, 1000, replace = TRUE)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# By default, the number of hash buckets is the same as the number of
# strings. To reduce the possibility of hash collisions (and possibly make look-ups
# faster), the number of hash buckets can be changed using the 'size_factor'
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mph <- mph_init(nms) # Allocate exactly length(nms) buckets
```
## Compare `match()` with `mph_match()`
```{r match}
bench::mark(
match(t0, nms),
mph_match(t0, mph)
)[, 1:5] |> knitr::kable()
bench::mark(
match(t1, nms),
mph_match(t1, mph)
)[, 1:5] |> knitr::kable()
bench::mark(
match(t2, nms),
mph_match(t2, mph)
)[, 1:5] |> knitr::kable()
bench::mark(
match(t3, nms),
mph_match(t3, mph)
)[, 1:5] |> knitr::kable()
```
## Vector subsetting - Extract 100 elements of a `vector` by name
```{r vector-subset}
bench::mark(
big_vector[t2],
big_vector[mph_match(t2, mph)]
)[, 1:5] |> knitr::kable()
```
## List subsetting - Extract 100 elements of a `list` by name
Also compare to using hashed named lookup in a standard R environment
```{r list-subset}
ee <- as.environment(big_list)
bench::mark(
`Standard R` = big_list[t2],
`R hashed environment` = mget(t2, ee),
`[] and mph indexing` = big_list[mph_match(t2, mph)]
)[, 1:5] |> knitr::kable()
```
## Time taken to build the hash
```{r}
set.seed(1)
chrs <- c(letters, LETTERS, 0:9)
N <- 1000
nms1k <- vapply(seq(N), \(i) paste(sample(chrs, 10, T), collapse = ""), character(1))
N <- 10000
nms10k <- vapply(seq(N), \(i) paste(sample(chrs, 10, T), collapse = ""), character(1))
N <- 100000
nms100k <- vapply(seq(N), \(i) paste(sample(chrs, 10, T), collapse = ""), character(1))
bench::mark(
mph_init(nms1k),
mph_init(nms10k),
mph_init(nms100k),
check = FALSE
)[, 1:5] |> knitr::kable()
```
## Billion Row Challenge indexing
The following example is a part of the [billion row challenge](https://github.com/jrosell/1br).
In this example, we are attempting to keep a streaming tally of the 3-letter codes
which are seen.
```{r warning=FALSE}
library(oomph)
library(insitu)
nms <- expand.grid(LETTERS, LETTERS, LETTERS) |>
apply(1, paste0, collapse = "")
counts <- numeric(length(nms))
names(counts) <- nms
mph <- mph_init(nms)
set.seed(1)
random_nms <- sample(nms, 1000)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# updating in bulk
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bench::mark(
baseR = {i <- match(random_nms, nms); counts[i] <- counts[i] + 1},
oomph = {i <- mph_match(random_nms, mph); counts[i] <- counts[i] + 1},
`oomph + insitu` = {br_add(counts, 1, idx = mph_match(random_nms, mph))},
check = FALSE
)[, 1:5] |> knitr::kable()
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Updating within a for loop
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bench::mark(
baseR = {
for (nm in random_nms) {
i <- match(nm, nms)
counts[i] <- counts[i] + 1
}
},
oomph = {
for (nm in random_nms) {
i <- mph_match(nm, mph)
counts[i] <- counts[i] + 1
}
},
`oomph + insitu` = {
for (nm in random_nms) {
br_add(counts, 1, idx = mph_match(nm, mph))
}
},
check = FALSE
)[, 1:5] |> knitr::kable()
```