Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/peterthehan/group-similar
Group similar items together.
https://github.com/peterthehan/group-similar
algorithm cluster compare comparison disjoint-set edit-distance fuzzy group group-similar grouping levenshtein match matching merge-find similar similarity string union-find
Last synced: 3 months ago
JSON representation
Group similar items together.
- Host: GitHub
- URL: https://github.com/peterthehan/group-similar
- Owner: peterthehan
- License: mit
- Created: 2021-09-25T10:04:31.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-10-19T11:17:38.000Z (over 1 year ago)
- Last Synced: 2024-09-12T22:23:50.103Z (4 months ago)
- Topics: algorithm, cluster, compare, comparison, disjoint-set, edit-distance, fuzzy, group, group-similar, grouping, levenshtein, match, matching, merge-find, similar, similarity, string, union-find
- Language: TypeScript
- Homepage: https://www.npmjs.com/package/group-similar
- Size: 1.34 MB
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# Group Similar
[![Discord](https://discord.com/api/guilds/258167954913361930/embed.png)](https://discord.gg/WjEFnzC) [![Twitter Follow](https://img.shields.io/twitter/follow/peterthehan.svg?style=social)](https://twitter.com/peterthehan)
Group similar items together.
Runtime complexity is `O(N^2 * (M + α(N)))`, where `N` is the number of elements in `items`, `M` is the runtime complexity of the `similarityFunction`, and `α(N)` is the [inverse Ackermann function](https://en.wikipedia.org/wiki/Disjoint-set_data_structure#Time_complexity) (amortized constant time for all practical purposes).
Space complexity is `O(N)`.
## Getting started
```
npm i group-similar
```## Examples
Group similar strings
```ts
import { groupSimilar } from "group-similar";
import { distance } from "fastest-levenshtein";function levenshteinSimilarityFunction(a: string, b: string): number {
return a.length === 0 && b.length === 0
? 1
: 1 - distance(a, b) / Math.max(a.length, b.length);
}groupSimilar({
items: ["cat", "bat", "kitten", "dog", "sitting"],
mapper: (i) => i,
similarityFunction: levenshteinSimilarityFunction,
similarityThreshold: 0.5,
});// [ [ 'cat', 'bat' ], [ 'kitten', 'sitting' ], [ 'dog' ] ]
```Group similar numbers
```ts
import { groupSimilar } from "group-similar";function evenOddSimilarityFunction(a: number, b: number): number {
return Number(a % 2 === b % 2);
}groupSimilar({
items: [1, 5, 10, 0, 2, 123],
mapper: (i) => i,
similarityFunction: evenOddSimilarityFunction,
similarityThreshold: 1,
});// [ [ 1, 5, 123 ], [ 10, 0, 2 ] ]
```Group similar objects
```ts
import { groupSimilar } from "group-similar";
import { distance } from "fastest-levenshtein";function nestedMapper(object: { a: { b: { value: string } } }): string {
return object.a.b.value;
}function levenshteinSimilarityFunction(a: string, b: string): number {
return a.length === 0 && b.length === 0
? 1
: 1 - distance(a, b) / Math.max(a.length, b.length);
}groupSimilar({
items: [
{ a: { b: { value: "sitting" } } },
{ a: { b: { value: "dog" } } },
{ a: { b: { value: "kitten" } } },
{ a: { b: { value: "bat" } } },
{ a: { b: { value: "cat" } } },
],
mapper: nestedMapper,
similarityFunction: levenshteinSimilarityFunction,
similarityThreshold: 0.5,
});// [
// [{ a: { b: { value: "sitting" } } }, { a: { b: { value: "kitten" } } }],
// [{ a: { b: { value: "dog" } } }],
// [{ a: { b: { value: "bat" } } }, { a: { b: { value: "cat" } } }],
// ]
```## Syntax
```ts
groupSimilar(options);
groupSimilar({ items, mapper, similarityFunction, similarityThreshold });
```### Parameters
| Parameter | Type | Required | Default | Description |
| ------------------- | -------- | -------- | ------- | --------------------------------- |
| [options](#options) | `Object` | Yes | _none_ | Arguments to pass to the function |#### Options
| Property | Type | Required | Default | Description |
| ------------------- | ------------------------ | -------- | ------- | --------------------------------------------------------------------------------------------------- |
| items | `T[]` | Yes | _none_ | Array of items to group |
| mapper | `(t: T) => K` | Yes | _none_ | Function to apply to each element in items prior to measuring similarity |
| similarityFunction | `(a: K, b: K) => number` | Yes | _none_ | Function to measure similarity between mapped items |
| similarityThreshold | `number` | Yes | _none_ | Threshold at which items whose similarity value is greater than or equal to it are grouped together |### Return value
The **return value** is a new nested array of type `T[][]` containing elements of `items` grouped by similarity. If there are no elements in `items`, an empty array will be returned.
## Benchmark
Benchmark test results where `N` is the number of items being grouped, higher `ops/sec` is better.
| Library | N=16 | N=32 | N=64 | N=128 | N=256 | N=512 | N=1024 | N=2048 |
| -------------------------------------------------------------- | ----- | ----- | ---- | ----- | ----- | ----- | ------ | ------ |
| [group-similar](https://www.npmjs.com/package/group-similar) | 86867 | 17538 | 6067 | 1594 | 444 | 171 | 75 | 27 |
| [set-clustering](https://www.npmjs.com/package/set-clustering) | 28506 | 6258 | 1831 | 455 | 121 | 30 | 6 | 1 |Benchmark configuration details can be found [here](./test/benchmark.ts).