https://github.com/coolbutuseless/oomph

Faster subsetting of vectors and lists by name
https://github.com/coolbutuseless/oomph

package r rpackage

Last synced: 9 months ago
JSON representation

Faster subsetting of vectors and lists by name

Host: GitHub
URL: https://github.com/coolbutuseless/oomph
Owner: coolbutuseless
License: other
Created: 2024-12-19T05:40:26.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-07-10T06:42:44.000Z (11 months ago)
Last Synced: 2025-07-10T10:10:09.318Z (11 months ago)
Topics: package, r, rpackage
Language: C
Homepage:
Size: 259 KB
Stars: 22
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE

Awesome Lists containing this project

README

          ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = FALSE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

library(oomph)

```

# oomph

![](https://img.shields.io/badge/cool-useless-green.svg)

[![R-CMD-check](https://github.com/coolbutuseless/oomph/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/coolbutuseless/oomph/actions/workflows/R-CMD-check.yaml)

`oomph` is a package for fast string matching within a static set of strings.

This is useful for fast named look-up in fixed lists and vectors.

Internally this uses a hash-map in C to map strings to integers.  In R, 

this appears as a minimal perfect hash where each string maps to its index, 

and unknown strings return `NA`

The hashed look-up can be more than **1000x** faster than R's standard look-up method (depending on 

number of elements in original object and the number of elements to extract).

## What's in the box

* `mph <- mph_init(s, size_factor)` initialise a hash with the given set of strings

    * Using a larger `size_factor` (than the default of `1`) decreases the number 

      of hash collisions, and can make other operations faster at the cost or

      more memory being allocated.

* `mph_match(s, mph)` find the indices of the strings `s` (equivalent to R's `match()`)

## Installation

You can install from [GitHub](https://github.com/coolbutuseless/oomph) with:

``` r

# install.package('remotes')

remotes::install_github('coolbutuseless/oomph')

```

## Setup test data

```{r}

library(oomph)

N <- 500000

set.seed(1)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# 500k random names

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

nms <- vapply(seq(N), \(i) paste(sample(c(letters, LETTERS, 0:9), 10, T), collapse = ""), character(1))

head(nms)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# A big named vector and named list (each with 500k elements)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

big_vector <- seq(N)

big_list   <- as.list(seq(N))

names(big_vector) <- nms

names(big_list  ) <- nms

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# Probe sets to use for testing

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

t0  <- sample(nms,    1, replace = TRUE)

t1  <- sample(nms,   10, replace = TRUE)

t2  <- sample(nms,  100, replace = TRUE)

t3  <- sample(nms, 1000, replace = TRUE)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# By default, the number of hash buckets is the same as the number of 

# strings.  To reduce the possibility of hash collisions (and possibly make look-ups

# faster), the number of hash buckets can be changed using the 'size_factor'

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mph <- mph_init(nms) # Allocate exactly length(nms) buckets

```

## Compare `match()` with `mph_match()`

```{r match}

bench::mark(

  match(t0, nms),

  mph_match(t0, mph)

)[, 1:5] |> knitr::kable()

bench::mark(

  match(t1, nms),

  mph_match(t1, mph)

)[, 1:5] |> knitr::kable()

bench::mark(

  match(t2, nms),

  mph_match(t2, mph)

)[, 1:5] |> knitr::kable()

bench::mark(

  match(t3, nms),

  mph_match(t3, mph)

)[, 1:5] |> knitr::kable()

```

## Vector subsetting - Extract 100 elements of a `vector` by name

```{r vector-subset}

bench::mark(

  big_vector[t2],

  big_vector[mph_match(t2, mph)]

)[, 1:5] |> knitr::kable()

```

## List subsetting - Extract 100 elements of a `list` by name

Also compare to using hashed named lookup in a standard R environment

```{r list-subset}

ee <- as.environment(big_list)

bench::mark(

  `Standard R`           = big_list[t2],

  `R hashed environment` = mget(t2, ee),

  `[] and mph indexing`  = big_list[mph_match(t2, mph)]

)[, 1:5] |> knitr::kable()

```

## Time taken to build the hash

```{r}

set.seed(1)

chrs <- c(letters, LETTERS, 0:9)

N <- 1000

nms1k <- vapply(seq(N), \(i) paste(sample(chrs, 10, T), collapse = ""), character(1))

N <- 10000

nms10k <- vapply(seq(N), \(i) paste(sample(chrs, 10, T), collapse = ""), character(1))

N <- 100000

nms100k <- vapply(seq(N), \(i) paste(sample(chrs, 10, T), collapse = ""), character(1))

bench::mark(

  mph_init(nms1k),

  mph_init(nms10k),

  mph_init(nms100k),

  check = FALSE

)[, 1:5] |> knitr::kable()

```

## Billion Row Challenge indexing

The following example is a part of the [billion row challenge](https://github.com/jrosell/1br).

In this example, we are attempting to keep a streaming tally of the 3-letter codes

which are seen.

```{r warning=FALSE}

library(oomph)

library(insitu)

nms <- expand.grid(LETTERS, LETTERS, LETTERS) |> 

  apply(1, paste0, collapse = "")

counts <- numeric(length(nms))

names(counts) <- nms

mph <- mph_init(nms)

set.seed(1)

random_nms <- sample(nms, 1000)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# updating in bulk

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

bench::mark(

  baseR            = {i <- match(random_nms, nms); counts[i] <- counts[i] + 1},

  oomph            = {i <- mph_match(random_nms, mph); counts[i] <- counts[i] + 1},

  `oomph + insitu` = {br_add(counts, 1, idx =  mph_match(random_nms, mph))},

  check = FALSE

)[, 1:5] |> knitr::kable()

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# Updating within a for loop

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

bench::mark(

  baseR = {

    for (nm in random_nms) {

      i <- match(nm, nms)

      counts[i] <- counts[i] + 1

    }

  },

  oomph = {

    for (nm in random_nms) {

      i <- mph_match(nm, mph)

      counts[i] <- counts[i] + 1

    }

  },

  `oomph + insitu` = {

    for (nm in random_nms) {

      br_add(counts, 1, idx = mph_match(nm, mph))

    }

  },

  check = FALSE

)[, 1:5] |> knitr::kable()

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/coolbutuseless/oomph

Awesome Lists containing this project

README