Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/m-elbably/symspell-ex

Distributed spelling correction & fuzzy search based on symmetric delete spelling correction algorithm (SymSpell)
https://github.com/m-elbably/symspell-ex

algorithm nlp spelling-correction

Last synced: about 2 months ago
JSON representation

Distributed spelling correction & fuzzy search based on symmetric delete spelling correction algorithm (SymSpell)

Awesome Lists containing this project

README

        

# **SymSpellEx**

Spelling correction & Fuzzy search based on *Symmetric Delete Spelling Correction algorithm* (SymSpell)

> Work in progress, need more optimizations

[![Node Version](https://img.shields.io/node/v/symspell-ex.svg)](https://nodejs.org)
[![npm version](https://img.shields.io/npm/v/symspell-ex/latest.svg)](https://www.npmjs.com/package/symspell-ex)
[![Build Status](https://github.com/m-elbably/symspell-ex/workflows/symspell-ex/badge.svg)](https://github.com/m-elbably/symspell-ex/workflows/symspell-ex/badge.svg)
[![Coverage Status](https://coveralls.io/repos/github/m-elbably/symspell-ex/badge.svg?branch=main&t=pKbxsM)](https://coveralls.io/github/m-elbably/symspell-ex?branch=main)
[![License](https://img.shields.io/github/license/m-elbably/symspell-ex.svg)](https://raw.githubusercontent.com/m-elbably/symspell-ex/master/LICENSE)

## Installation
npm install symspell-ex --save

> Changes v1.1.1
> - Tokenization support
> - Term frequency should be provided for training and terms should be unique
> - Correct function return different object (`Correction` object)
> - Hash table implemented in redis store instead of normal list structure
> - Enhanced testing code and coverage
> - Fixed bugs in lookup

## Features
- ***Very fast***
- ***Word suggestions***
- ***Word correction***
- ***Multiple languages supported*** - _The algorithm, and the implementation are language independent_
- ***Extendable*** - _Edit distance and data stores can be implemented to extend library functionalities_

## Usage

### Training

For single term training you can use `add` function:
```typescript
import {SymSpellEx, MemoryStore} from 'symspell-ex';

const LANGUAGE = 'en';
// Create SymSpellEx instnce and inject store new store instance
symSpellEx = new SymSpellEx(new MemoryStore());
await symSpellEx.initialize();
// Train data
await symSpellEx.add("argument", LANGUAGE);
await symSpellEx.add("computer", LANGUAGE);
```

For multiple terms (Array) you can use `train` function:
```typescript
const terms = ['argument', 'computer'];
await symSpellEx.train(terms, 1, LANGUAGE);
```

### Searching
`search` function can be used to get multiple suggestions if available up to the `maxSuggestions` value

Arguments:
- `input` _String_ (Wrong/Invalid word we need to correct)
- `language` _String_ (Language to be used in search)
- `maxDistance` _Number_, _optional_, default = `2` (Maximum distance for suggestions)
- `maxSuggestions` _Number_, _optional_, default = `5` (Maximum suggestions number to return)

Return: `Array` Array of suggestions

Example
```typescript
await symSpellEx.search('argoments', 'en');
```
Example `Suggestion Object`:
```json
{
"term": "argoments",
"suggestion": "arguments",
"distance": 2,
"frequency": 155
}
```

### Correction
`correct` function can be used to get the best suggestion for input word or sentence in terms of edit distance and frequency

Arguments:
- `input` _String_ (Wrong/Invalid word we need to correct)
- `language` _String_ (Language to be used in search)
- `maxDistance` _Number_, _optional_, default = `2` (Maximum distance for suggestions)

Return: `Correction` object which contains original `input` and corrected `output` string, with array of suggestions

Example
```typescript
await symSpellEx.correct('Special relatvity was orignally proposed by Albert Einstein', 'en');
```

Returns this `Correction` object:
> This output is totally depending on the quality of the training data that was push into the store
```json
{
"suggestions": [],
"input": "Special relatvity was orignally proposed by Albert Einstein",
"output": "Special relativity was originally proposed by Albert Einstein"
}
```

## Computational Complexity
The algorithm has constant time O(1) time, independent of the dictionary size, but depend on the average term length and maximum edit distance,
_Hash Table_ is used to store all search entries which has an average search time complexity of O(1).

## Why the algorithm is fast?

### Pre-calculation
in training phase all possible spelling error variants as generated (deletes only) and stored in hash table

> This makes the algorithm very fast, but it also required a large memory footprint, and the training phase takes a considerable amount of time to build the dictionary first time.
> _(Using RedisStore makes it easy to train and build once, then search and correct from any external source)_

### Symmetric Delete Spelling Correction
It allows a tremendous reduction of the number of spelling error candidates to be pre-calculated (generated and added to hash table),
which then allows O(1) search while getting spelling suggestions.

## Library Design
![Lib Diagram](./res/images/diagram.svg "SymSpellEx")

### Tokenizer
This interface can be implemented to provide a different tokenizer for the library

Interface type
```typescript
export interface Tokenizer {
tokenize(input: string): Array;
}
```

### EditDistance
This interface can be implemented to provide more algorithms to use to calculate edit distance between two words
> **[Edit Distance](https://en.wikipedia.org/wiki/Edit_distance)**
> is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other

Interface type
```typescript
interface EditDistance {
name: String;
calculateDistance(source: string, target: string): number;
}
```

### DataStore
This interface can be implemented to provide additional method to store data other than built-in stores (Memory, Redis)

Data store should handle storage for these 2 data types:
- **Terms**: List data structure to store terms and retrieve it by index
- **Entries**: Hash Table data structure to store dictionary entries and retrieve data by term (_Key_)

> Data store should also handle storage for multiple languages and switch between them

Interface type
```typescript
export interface DataStore {
name: string;
initialize(): Promise;
isInitialized(): boolean;
setLanguage(language: string): Promise;
pushTerm(value: string): Promise;
getTermAt(index: number): Promise;
getTermsAt(indexes: Array): Promise>;
getEntry(key: string): Promise>;
getEntries(keys: Array): Promise>>;
setEntry(key: string, value: Array): Promise;
hasEntry(key: string): Promise;
maxEntryLength(): Promise;
clear(): Promise;
}
```

#### Built-in data stores
- Memory: Stores data in memory, using array structure for terms and high speed hash table [(megahash)](https://www.npmjs.com/package/megahash)
to manage dictionary entries
> May be limited by node process memory limits, which can be overridden

- Redis: Stores data into Redis database using list structure to store terms and hash to store dictionary data
> Very efficient way to train and store data, it will allow accessing by multiple processes and/or machines,
> also dumping and migrating data will be easy

TODO

- [x] Tokenization
- [x] Sentence correction
- [x] Bulk data training
- [ ] Word Segmentation
- [ ] Domain specific correction

#### References
* [Symmetric Delete Spelling Correction](https://wolfgarbe.medium.com/1000x-faster-spelling-correction-algorithm-2012-8701fcd87a5f)
* [SymSpell](https://github.com/wolfgarbe/SymSpell)

#### License
MIT