Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mkearney/funique

⌚️ A faster unique() function
https://github.com/mkearney/funique

data-frame data-wrangling date-time duplicates mkearney-r-package posix posixct r r-package rstats unique

Last synced: 3 months ago
JSON representation

⌚️ A faster unique() function

Host: GitHub
URL: https://github.com/mkearney/funique
Owner: mkearney
License: other
Created: 2018-05-15T19:52:13.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2018-12-28T18:37:34.000Z (about 6 years ago)
Last Synced: 2024-08-06T03:04:52.119Z (7 months ago)
Topics: data-frame, data-wrangling, date-time, duplicates, mkearney-r-package, posix, posixct, r, r-package, rstats, unique
Language: R
Homepage: https://funique.mikewk.com
Size: 7.15 MB
Stars: 19
Watchers: 4
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

README

        ---

output: github_document

---

```{r setup, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-"

)

drop_hl <- function(x, n = 1) {

  x <- tibble::as_tibble(x, validate = FALSE)

  x <- dplyr::arrange(x, expr, time)

  tot <- sum(x$expr == x$expr[1])

  g <- length(unique(x$expr))

  s <- (1 + n):(tot - n)

  s <- unlist(Map("+", list(s), c(0, cumsum(rep(tot, g - 1)))))

  structure(x[s, ], class = c("hlmb", "tbl_df", "tbl", "data.frame"))

}

plot.hlmb <- function(x) {

  x$time <- x$time / 1000

  min <- 0

  max <- max(x$time, na.rm = TRUE) * 1.05

  x$expr <- as.character(x$expr)

  ggplot2::ggplot(x, ggplot2::aes(x = expr, y = time, fill = expr)) +

    ggplot2::geom_boxplot(outlier.shape = NA, alpha = .6) +

    ggplot2::geom_jitter(shape = 21, size = ggplot2::rel(3), alpha = .6) +

    ggplot2::theme_minimal(base_size = 11, base_family = "Roboto Condensed") +

    ggplot2::theme(legend.position = "none",

      text = ggplot2::element_text(colour = "#444444"),

      axis.title = ggplot2::element_text(size = ggplot2::rel(1.0),

        hjust = 0.95, face = "italic", colour = "black"),

      axis.text.x = ggplot2::element_text(size = ggplot2::rel(0.9),

        colour = "black"),

      axis.text.y = ggplot2::element_text(size = ggplot2::rel(1.1),

        colour = "black", angle = 90, hjust = .5),

      plot.title = ggplot2::element_text(size = ggplot2::rel(1.4),

        colour = "black", face = "bold"),

      plot.subtitle = ggplot2::element_text(size = ggplot2::rel(1.1),

        colour = "black"),

      plot.caption = ggplot2::element_text(hjust = 0, size = ggplot2::rel(.95)),

      panel.grid.minor.x = ggplot2::element_blank(),

      panel.grid.major.x = ggplot2::element_line(linetype = "dashed"),

      panel.grid.major.y = ggplot2::element_line(linetype = "dashed"),

      axis.line.x = ggplot2::element_line(colour = "#44444422")) +

    ggplot2::labs(y = "Time (microseconds)", x = "Expression",

      title = "Benchmarking expression evaluation times",

      subtitle = "Boxplots overlayed with jittered replication times",

      caption = "Estimates from the {microbenchmark} pkg") +

    ggplot2::scale_y_continuous(limits = c(min, max)) +

    ggplot2::coord_flip() + 

    ggplot2::scale_fill_manual(values = c("greenyellow", "gray"))

}

library(funique)

```

# funique 

[![Travis build status](https://travis-ci.org/mkearney/funique.svg?branch=master)](https://travis-ci.org/mkearney/funique)

[![lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)

> ⌚️ A faster `unique()` function

## Installation

You can install the released version of funique from Github with:

```{r, eval = FALSE}

## install remotes pkg if not already

if (!requireNamespace("remotes", quietly = TRUE)) {

  install.packages("remotes")

}

## install funique from github

remotes::install_github("mkearney/funique")

```

## Usage

There's one function `funique()`, which is the same as `base::unique()` only optimized to be faster when data contain date-time variables.

## Speed test: `funique()` vs. `base::unique()`

The code below creates a data frame with several duplicate rows and then compares performance (in time) of `funique()` versus `base::unique()`.

```{r ex1, fig.keep = "none", eval = FALSE}

## set seed

set.seed(20180812)

## generate data

d <- data.frame(

  x = rnorm(1000),

  y = seq.POSIXt(as.POSIXct("2018-01-01"),

    as.POSIXct("2018-12-31"), length.out = 10))

## create data frame with duplicate rows

d <- d[c(1:1000, sample(1:1000, 500, replace = TRUE)), ]

row.names(d) <- NULL

## check the output against base::unique

identical(unique(d), funique(d))

## bench mark

(m <- microbenchmark::microbenchmark(unique(d), funique(d), 

  times = 200, unit = "relative"))

## plot

plot(drop_hl(m, n = 4)) + 

  ggplot2::ggsave("man/figures/r1.png", width = 8, height = 4.5, units = "in")

```

 

Here's another test this time using duplicate-infested Twitter data.

```{r ex2, fig.keep = "none", eval = FALSE}

## search for data on 100 tweets

rt <- rtweet::search_tweets("lang:en", verbose = FALSE)

## create duplicates

rt2 <- rt[sample(1:nrow(rt), 1000, replace = TRUE), ]

## benchmarks

(mb <- microbenchmark::microbenchmark(

  unique(rt2), funique(rt2), unit = "relative"))

## make sure the output is the same

identical(unique(rt2), funique(rt2))

## plot

plot(drop_hl(mb, n = 4)) + 

  ggplot2::ggsave("man/figures/r2.png", width = 8, height = 4.5, units = "in")

```