Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/uwdata/fast-kde-benchmarks

Research archive of methods and benchmarks for fast, approximate Gaussian kernel density estimation.
https://github.com/uwdata/fast-kde-benchmarks

Last synced: 7 days ago
JSON representation

Research archive of methods and benchmarks for fast, approximate Gaussian kernel density estimation.

Awesome Lists containing this project

README

        

# Fast & Accurate Gaussian Kernel Density Estimation

Methods and benchmarks for fast, approximate Gaussian kernel density estimation. This repo is for archiving research results. _Want to use fast KDE in your own projects?_ **See [github.com/uwdata/fast-kde](https://github.com/uwdata/fast-kde).**

This repo contains code (`src/`) and benchmark (`benchmarks/`) scripts for the IEEE VIS 2021 Short Paper [Fast & Accurate Gaussian Kernel Density Estimation](https://idl.cs.washington.edu/papers/fast-kde). All code is written as ESM modules. Running benchmarks requires Node.js version 13.2 or higher.

## Citation

If you use or build on this repo in academic work, please use this citation:

```
@inproceedings{2021-Heer-FastKDE,
title = {Fast \& Accurate Gaussian Kernel Density Estimation},
author = {Jeffrey Heer},
booktitle = {IEEE VIS Short Papers},
year = {2021},
url = {http://idl.cs.washington.edu/papers/fast-kde},
}
```

## Build Instructions

To build a bundle (ESM module or minified UMD):

1. Run `yarn` to install dependencies.
2. Run `yarn build` to build the bundles.

Compiled bundles will be written to the `dist` directory.

## Benchmark Instructions

1. Run `yarn` to install dependencies.
2. Run `yarn benchmarks` to run benchmarks.
3. Wait until the benchmarks complete...

Benchmark result datasets will be written to the `results` directory as JSON files.

## API Documentation

### Density Estimate Helpers

Convenience methods for performing 1D or 2D density estimation.


#
kde.density1d()

Creates a new 1D density estimate helper with default settings.

*Example*

```js
// perform 1D estimation with bandwidth = 1 over domain [0, 10]
// use default grid size (512) and method (kdeDeriche1d)
// returns an array of [ { x, value }, ... ] points
kde.density1d()
.bandwidth(1)
.extent([0, 10])
.points([1, 2, 5, 5, 6, 9])
```

#
density1d(data)

Computes a 1D Gaussian kernel density estimate using the current settings for the given *data* and returns a Float64Array of gridded density values. To instead produce an array of objects containing coordinate values and density estimates, use [density1d.points()](#density1d_points).

#
density1d.points(data)

Computes a 1D Gaussian kernel density estimate using the current settings for the given *data* and returns an array of objects containing the grid coordinate value (`x`) and density value (`value`).

#
density1d.x([x])

Get or set the *x* coordinate getter for the input data. Defaults to the identity function, which assumes a flat array of values.

#
density1d.grid([grid])

Get or set the *grid* function for binning the input data into a one-dimensional grid, such as [grid1d_linear](#grid1d_linear) (the default) or [grid1d_simple](#grid1d_simple).

#
density1d.size([size])

Get or set the *size* (default 512) of the binned data grid.

#
density1d.bandwidth([bandwidth])

Get or set the *bandwidth* (standard deviation) of the Gaussian kernel. If set to `null` or zero (the default), the [normal reference density](#nrd) heuristic will be used to select a bandwidth automatically.

#
density1d.extent([extent])

Get or set the *extent* ([x0, x1]) of the data domain over which to perform density estimation. Any data points outside this extent will be ignored. The value defaults to [0, 1]. Note that the estimation helper does _not_ automatically select a domain based on the input data.

#
density1d.method([method])

Get or set the density estimation *method* to use. One of [kdeDeriche1d](#kdeDeriche1d) (the default), [kdeBox1d](#kdeBox1d), [kdeExtBox1d](#kdeExtBox1d), or [kdeCDF1d](#kdeCDF1d).


#
kde.density2d()

Creates a new 2D density estimate helper with default settings.

*Example*

```js
// perform 2D estimation with bandwidth = 1 over domain [[0, 10], [0, 10]]
// use default grid size ([256, 256]) and method (kdeDeriche2d)
// returns an array of [ { x, y, value }, ... ] points
kde.density2d()
.bandwidth([1, 1])
.extent([[0, 10], [0, 10]])
.points([[1, 1], [1, 2], [5, 4], [5, 3], [6, 2], [8, 7]])
```

#
density2d(data)

Computes a 2D Gaussian kernel density estimate for the given *data* using the current settings and returns a Float64Array of gridded density values. To instead produce an array of objects containing coordinate values and density estimates, use [density2d.points()](#density2d_points).

#
density2d.points(data)

Computes a 2D Gaussian kernel density estimate for the given *data* using the current settings and returns an array of objects containing the grid coordinate (`x`, `y`) and density (`value`) values.

#
density2d.x([x])

Get or set the *x* coordinate getter for the input data. Defaults to index 0 (i.e., the first element if individual data points are provided as arrays).

#
density2d.y([y])

Get or set the *y* coordinate getter for the input data. Defaults to index 1 (i.e., the second element if individual data points are provided as arrays).

#
density2d.grid([grid])

Get or set the *grid* function for binning the input data into a two dimensional grid, such as [grid2d_linear](#grid2d_linear) (the default) or [grid2d_simple](#grid2d_simple).

#
density2d.size([size])

Get or set the *size* (default [256, 256]) of the binned data grid.

#
density2d.bandwidth([bandwidth])

Get or set the *bandwidth*s (standard deviations) of the Gaussian kernels as a two element array. If an entry is set to `null` or zero (the defaults), the [normal reference density](#nrd) heuristic will be used to select a bandwidth automatically for that dimension.

#
density2d.extent([extent])

Get or set the *extent*s ([[x0, x1], [y0, y1]]) of the data domain over which to perform density estimation. Any data points outside this extent will be ignored. The value defaults to [[0, 1], [0, 1]]. Note that the estimation helper does _not_ automatically select domains based on the input data.

#
density2d.method([method])

Get or set the density estimation *method* to use. One of [kdeDeriche2d](#kdeDeriche2d) (the default), [kdeBox2d](#kdeBox2d), [kdeExtBox2d](#kdeExtBox2d), or [kdeCDF2d](#kdeCDF2d).

### 1D Density Estimation Methods


#
kde.kdeCDF1d(data, extent, size, bandwidth)

Computes a 1D Gaussian kernel density estimate for an array of numeric *data* values via direct calculation of the Gaussian cumulative distribution function. This method can be very slow in practice and is provided only for comparison and testing purposes; see [kdeDeriche1d](#kdeDeriche1d) instead. The *extent* parameter is an [x0, x1] array indicating the data domain over which to compute the estimates. Any data points lying outside this domain will be ignored. The *size* parameter indicates the number of bins (grid steps) to use. The *bandwidth* parameter indicates the width (standard deviation) of the kernel function. Returns a Float64Array of gridded density estimates.


#
kde.kdeBox1d(data, extent, size, bandwidth, grid1d)

Computes a 1D Gaussian kernel density estimate for an array of numeric *data* values using the standard box filtering method. This method is not recommended for normal use, see [kdeDeriche1d](#kdeDeriche1d) instead. The *extent* parameter is an [x0, x1] array indicating the data domain over which to compute the estimates. Any data points lying outside this domain will be ignored. The *size* parameter indicates the number of bins (grid steps) to use. The *bandwidth* parameter indicates the width (standard deviation) of the kernel function. The *grid1d* parameter indicates the binning method to use, such as [grid1d_linear](#grid1d_linear) or [grid1d_simple](#grid1d_simple). Returns a Float64Array of gridded density estimates.


#
kde.kdeExtBox1d(data, extent, size, bandwidth, grid1d)

Computes a 1D Gaussian kernel density estimate for an array of numeric *data* values using the extended box filtering method, which smooths away the quantization error of standard box filtering. This method is not recommended for normal use, see [kdeDeriche1d](#kdeDeriche1d) instead. The *extent* parameter is an [x0, x1] array indicating the data domain over which to compute the estimates. Any data points lying outside this domain will be ignored. The *size* parameter indicates the number of bins (grid steps) to use. The *bandwidth* parameter indicates the width (standard deviation) of the kernel function. The *grid1d* parameter indicates the binning method to use, such as [grid1d_linear](#grid1d_linear) or [grid1d_simple](#grid1d_simple). Returns a Float64Array of gridded density estimates.


#
kde.kdeDeriche1D(data, extent, size, bandwidth, grid1d)

Computes a 1D Gaussian kernel density estimate for an array of numeric *data* values using Deriche's recursive filter approximation. This method is recommended for normal use. The *extent* parameter is an [x0, x1] array indicating the data domain over which to compute the estimates. Any data points lying outside this domain will be ignored. The *size* parameter indicates the number of bins (grid steps) to use. The *bandwidth* parameter indicates the width (standard deviation) of the kernel function. The *grid1d* parameter indicates the binning method to use, such as [grid1d_linear](#grid1d_linear) or [grid1d_simple](#grid1d_simple). Returns a Float64Array of gridded density estimates.

### 2D Density Estimation Methods


#
kde.kdeCDF2d(data, extent, size, bandwidth)

Computes a 2D Gaussian kernel density estimate for two arrays of numeric *data* values via direct calculation of the Gaussian cumulative distribution function. This method can be very slow in practice and is provided only for comparison and testing purposes; see [kdeDeriche2d](#kdeDeriche2d) instead. The input *data* should be an array with two entries: a numeric x-value array and a numeric y-value array. The *extent* parameter should be an [[x0, x1], [y0, y1]] array of data domain extents. The *size* parameter should be a [sizeX, sizeY] array indicating the grid dimensions. The *bandwidth* parameter should be a [bandwidthX, bandwidthY] array. Returns a Float64Array of gridded density estimates.


#
kde.kdeBox2d(data, extent, size, bandwidth, grid2d)

Computes a 2D Gaussian kernel density estimate for two arrays of numeric *data* values using the standard box filtering method. This method is not recommended for normal use, see [kdeDeriche2d](#kdeDeriche2d) instead. The input *data* should be an array with two entries: a numeric x-value array and a numeric y-value array. The *extent* parameter should be an [[x0, x1], [y0, y1]] array of data domain extents. The *size* parameter should be a [sizeX, sizeY] array indicating the grid dimensions. The *bandwidth* parameter should be a [bandwidthX, bandwidthY] array. The *grid2d* parameter indicates the binning method to use, such as [grid2d_linear](#grid2d_linear) or [grid2d_simple](#grid2d_simple). Returns a Float64Array of gridded density estimates.


#
kde.kdeExtBox2d(data, extent, size, bandwidth, grid2d)

Computes a 2D Gaussian kernel density estimate for two arrays of numeric *data* values using the extended box filtering method, which smooths away the quantization error of standard box filtering. This method is not recommended for normal use, see [kdeDeriche2d](#kdeDeriche2d) instead. The input *data* should be an array with two entries: a numeric x-value array and a numeric y-value array. The *extent* parameter should be an [[x0, x1], [y0, y1]] array of data domain extents. The *size* parameter should be a [sizeX, sizeY] array indicating the grid dimensions. The *bandwidth* parameter should be a [bandwidthX, bandwidthY] array. The *grid2d* parameter indicates the binning method to use, such as [grid2d_linear](#grid2d_linear) or [grid2d_simple](#grid2d_simple). Returns a Float64Array of gridded density estimates.


#
kde.kdeDeriche1D(data, extent, size, bandwidth, grid2d)

Computes a 2D Gaussian kernel density estimate for two arrays of numeric *data* values using Deriche's recursive filter approximation. This method is recommended for normal use. The input *data* should be an array with two entries: a numeric x-value array and a numeric y-value array. The *extent* parameter should be an [[x0, x1], [y0, y1]] array of data domain extents. The *size* parameter should be a [sizeX, sizeY] array indicating the grid dimensions. The *bandwidth* parameters should be a [bandwidthX, bandwidthY] array. The *grid2d* parameter indicates the binning method to use, such as [grid2d_linear](#grid2d_linear) or [grid2d_simple](#grid2d_simple). Returns a Float64Array of gridded density estimates.

### Utility Methods

Utility methods for grid construction and bandwidth suggestion.


#
kde.grid1d_linear(data, size, init, scale[, offset])

Bins a set of numeric *data* values into a 1D output grid of size *size*, using linear interpolation of point mass across adjacent bins. The *init* value indicates the minimum grid value for the data domain. The *scale* parameter is a numeric value for scaling data values to a [0, 1] domain. The optional *offset* parameter (default 0) indicates the number of bins by which to offset the start of the data domain, in the case of padded grids with extra spacing. Returns a Float64Array of gridded values.


#
kde.grid1d_linear(data, size, init, scale[, offset])

Bins a set of numeric *data* values into a 1D output grid of size *size*, where the full weight of a point is assigned to the nearest enclosing bin. The *init* value indicates the minimum grid value for the data domain. The *scale* parameter is a numeric value for scaling data values to a [0, 1] domain. The optional *offset* parameter (default 0) indicates the number of bins by which to offset the start of the data domain, in the case of padded grids with extra spacing. Returns a Float64Array of gridded values.


#
kde.grid2d_linear(data, size, init, scale, offset)

Bins a set of numeric *data* values into a 2D output grid of size *size*, using linear interpolation of point mass across adjacent bins. The input *data* should be an array with two entries: a numeric x-value array and a numeric y-value array. The *size* parameter sets the grid dimensions and should be an [sizeX, sizeY] array containing two integers. The *init* parameter ([initX, initY]) indicates the minimum grid values for the data domains. The *scale* parameter ([scaleX, scaleY]) provides numeric values for scaling data values to a [0, 1] domain. The *offset* parameter ([offsetX, offsetY]) indicates the number of bins by which to offset the start of the data domains, in the case of padded grids with extra spacing. Returns a Float64Array of gridded values.


#
kde.grid2d_simple(data, size, init, scale, offset)

Bins a set of numeric *data* values into a 2D output grid of size *size*, where the full weight of a point is assigned to the nearest enclosing bin. The input *data* should be an array with two entries: a numeric x-value array and a numeric y-value array. The *size* parameter sets the grid dimensions and should be an [sizeX, sizeY] array containing two integers. The *init* parameter ([initX, initY]) indicates the minimum grid values for the data domains. The *scale* parameter ([scaleX, scaleY]) provides numeric values for scaling data values to a [0, 1] domain. The *offset* parameter ([offsetX, offsetY]) indicates the number of bins by which to offset the start of the data domains, in the case of padded grids with extra spacing. Returns a Float64Array of gridded values.


#
kde.nrd(data)

Calculates a suggested bandwidth for an array of numeric *data* values, using Scott's normal reference distribution (NRD) heuristic.