https://github.com/peterthehan/group-similar

Group similar items together.
https://github.com/peterthehan/group-similar

algorithm cluster compare comparison disjoint-set edit-distance fuzzy group group-similar grouping levenshtein match matching merge-find similar similarity string union-find

Last synced: 4 months ago
JSON representation

Group similar items together.

Host: GitHub
URL: https://github.com/peterthehan/group-similar
Owner: peterthehan
License: mit
Created: 2021-09-25T10:04:31.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-10-19T11:17:38.000Z (over 1 year ago)
Last Synced: 2025-02-02T05:04:41.832Z (5 months ago)
Topics: algorithm, cluster, compare, comparison, disjoint-set, edit-distance, fuzzy, group, group-similar, grouping, levenshtein, match, matching, merge-find, similar, similarity, string, union-find
Language: TypeScript
Homepage: https://www.npmjs.com/package/group-similar
Size: 1.34 MB
Stars: 5
Watchers: 2
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

        # Group Similar

[![Discord](https://discord.com/api/guilds/258167954913361930/embed.png)](https://discord.gg/WjEFnzC) [![Twitter Follow](https://img.shields.io/twitter/follow/peterthehan.svg?style=social)](https://twitter.com/peterthehan)

Group similar items together.

Runtime complexity is `O(N^2 * (M + α(N)))`, where `N` is the number of elements in `items`, `M` is the runtime complexity of the `similarityFunction`, and `α(N)` is the [inverse Ackermann function](https://en.wikipedia.org/wiki/Disjoint-set_data_structure#Time_complexity) (amortized constant time for all practical purposes).

Space complexity is `O(N)`.

## Getting started

```

npm i group-similar

```

## Examples

Group similar strings

```ts

import { groupSimilar } from "group-similar";

import { distance } from "fastest-levenshtein";

function levenshteinSimilarityFunction(a: string, b: string): number {

  return a.length === 0 && b.length === 0

    ? 1

    : 1 - distance(a, b) / Math.max(a.length, b.length);

}

groupSimilar({

  items: ["cat", "bat", "kitten", "dog", "sitting"],

  mapper: (i) => i,

  similarityFunction: levenshteinSimilarityFunction,

  similarityThreshold: 0.5,

});

// [ [ 'cat', 'bat' ], [ 'kitten', 'sitting' ], [ 'dog' ] ]

```

Group similar numbers

```ts

import { groupSimilar } from "group-similar";

function evenOddSimilarityFunction(a: number, b: number): number {

  return Number(a % 2 === b % 2);

}

groupSimilar({

  items: [1, 5, 10, 0, 2, 123],

  mapper: (i) => i,

  similarityFunction: evenOddSimilarityFunction,

  similarityThreshold: 1,

});

// [ [ 1, 5, 123 ], [ 10, 0, 2 ] ]

```

Group similar objects

```ts

import { groupSimilar } from "group-similar";

import { distance } from "fastest-levenshtein";

function nestedMapper(object: { a: { b: { value: string } } }): string {

  return object.a.b.value;

}

function levenshteinSimilarityFunction(a: string, b: string): number {

  return a.length === 0 && b.length === 0

    ? 1

    : 1 - distance(a, b) / Math.max(a.length, b.length);

}

groupSimilar({

  items: [

    { a: { b: { value: "sitting" } } },

    { a: { b: { value: "dog" } } },

    { a: { b: { value: "kitten" } } },

    { a: { b: { value: "bat" } } },

    { a: { b: { value: "cat" } } },

  ],

  mapper: nestedMapper,

  similarityFunction: levenshteinSimilarityFunction,

  similarityThreshold: 0.5,

});

// [

//   [{ a: { b: { value: "sitting" } } }, { a: { b: { value: "kitten" } } }],

//   [{ a: { b: { value: "dog" } } }],

//   [{ a: { b: { value: "bat" } } }, { a: { b: { value: "cat" } } }],

// ]

```

## Syntax

```ts

groupSimilar(options);

groupSimilar({ items, mapper, similarityFunction, similarityThreshold });

```

### Parameters

| Parameter           | Type     | Required | Default | Description                       |

| ------------------- | -------- | -------- | ------- | --------------------------------- |

| [options](#options) | `Object` | Yes      | _none_  | Arguments to pass to the function |

#### Options

| Property            | Type                     | Required | Default | Description                                                                                         |

| ------------------- | ------------------------ | -------- | ------- | --------------------------------------------------------------------------------------------------- |

| items               | `T[]`                    | Yes      | _none_  | Array of items to group                                                                             |

| mapper              | `(t: T) => K`            | Yes      | _none_  | Function to apply to each element in items prior to measuring similarity                            |

| similarityFunction  | `(a: K, b: K) => number` | Yes      | _none_  | Function to measure similarity between mapped items                                                 |

| similarityThreshold | `number`                 | Yes      | _none_  | Threshold at which items whose similarity value is greater than or equal to it are grouped together |

### Return value

The **return value** is a new nested array of type `T[][]` containing elements of `items` grouped by similarity. If there are no elements in `items`, an empty array will be returned.

## Benchmark

Benchmark test results where `N` is the number of items being grouped, higher `ops/sec` is better.

| Library                                                        | N=16  | N=32  | N=64 | N=128 | N=256 | N=512 | N=1024 | N=2048 |

| -------------------------------------------------------------- | ----- | ----- | ---- | ----- | ----- | ----- | ------ | ------ |

| [group-similar](https://www.npmjs.com/package/group-similar)   | 86867 | 17538 | 6067 | 1594  | 444   | 171   | 75     | 27     |

| [set-clustering](https://www.npmjs.com/package/set-clustering) | 28506 | 6258  | 1831 | 455   | 121   | 30    | 6      | 1      |

Benchmark configuration details can be found [here](./test/benchmark.ts).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/peterthehan/group-similar

Awesome Lists containing this project

README