Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/antononcube/raku-math-distancefunctions-edit
Raku package of fast Demerau-Levenshtein distance functions based on C code via NativeCall.
https://github.com/antononcube/raku-math-distancefunctions-edit
damerau-levenshtein damerau-levenshtein-distance distance-function edit-distance raku rakulang
Last synced: about 4 hours ago
JSON representation
Raku package of fast Demerau-Levenshtein distance functions based on C code via NativeCall.
- Host: GitHub
- URL: https://github.com/antononcube/raku-math-distancefunctions-edit
- Owner: antononcube
- License: artistic-2.0
- Created: 2024-08-04T15:54:53.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-09-11T16:00:08.000Z (5 months ago)
- Last Synced: 2024-12-15T14:16:20.816Z (about 2 months ago)
- Topics: damerau-levenshtein, damerau-levenshtein-distance, distance-function, edit-distance, raku, rakulang
- Language: Raku
- Homepage: https://raku.land/zef:antononcube/Math::DistanceFunctions::Edit
- Size: 50.8 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README-work.md
- License: LICENSE
Awesome Lists containing this project
README
# Math::DistanceFunctions::Edit
[![Actions Status](https://github.com/antononcube/Raku-Math-DistanceFunctions-Edit/actions/workflows/linux.yml/badge.svg)](https://github.com/antononcube/Raku-Math-DistanceFunctions-Edit/actions)
[![Actions Status](https://github.com/antononcube/Raku-Math-DistanceFunctions-Edit/actions/workflows/macos.yml/badge.svg)](https://github.com/antononcube/Raku-Math-DistanceFunctions-Edit/actions)[![License: Artistic-2.0](https://img.shields.io/badge/License-Artistic%202.0-0298c3.svg)](https://opensource.org/licenses/Artistic-2.0)
Raku package of fast Damerau-Levenshtein distance functions based on C code via "NativeCall".
For a pure Raku implementation see ["Text::Levenshtein::Damerau"](https://raku.land/github:ugexe/Text::Levenshtein::Damerau), [NLp1].
-----
## Usage examples
The main function provided by this package is `edit-distance`.
Here is comparison invocation with `dld` from "Text::Levenshtein::Damerau"
over two string arguments:```perl6
use Math::DistanceFunctions::Edit;
use Text::Levenshtein::Damerau;my ($w1, $w2) = ('examples', 'samples');
say 'edit-distance : ', edit-distance($w1, $w2);
say 'dld : ', dld($w1, $w2);
```Vectors of integers, booleans, or strings can be also used:
```perl6
edit-distance(, ):ignore-case;
``````perl6
edit-distance([True, False, False, True], [True, False, False]);
```**Remark:** Currently, elements of integer lists are converted to `int32`.
If larger integers are used then convert to `Str` first.-----
## Motivation
The motivation for making this package was the slow performance of the DSL translation functions in the package
["DSL::Translators"](https://github.com/antononcube/Raku-DSL-Translators), [AAp1].
After profiling, it turned out about 50% of the time is spent in the function `dld` by "Text::Levenshtein::Demerau".That is the case because of the fuzzy marching which "DSL::Translators" does:
```perl6
use DSL::Translators;dsl-translation('use @dfTitanic; group by sex; show couns;', to => 'Raku')
```The slowdown effect of the "expensive" to compute results by `dld` can be addressed by:
- Certain clever checks can be made before invoking `dld`.
- Create a new function called `edit-distance` in C and set up a "NativeCall" connection to it.So, at this point, both approaches were taken: the first in "DSL::Shared", [AAp2], the second by "Math::DistanceFunctions::Edit".
-----
## Implementation
The design of "NativeCall" hook-up is taken from ["Algorithm::KdTree"](https://raku.land/github:titsuki/Algorithm::KdTree), [ITp1].
The actual C-implementation was made by several iterations of LLM code generation.
I considered re-programming to C the Raku code of `dld` in [NLp1], but since
[Damerau-Levenshtein distance](https://en.wikipedia.org/wiki/Damerau–Levenshtein_distance) is a
[very well known, popular topic](https://rosettacode.org/wiki/Levenshtein_distance)
LLM generations with simple prompts were used.(And, yes, I read the code and tested it.)
-----
## Profiling and performance
Since the speed is the most important reason for this package, after its complete initial version,
profiling was done each refactoring step. See the file ["faster-word-distances.raku"](./examples/faster-word-distances.raku).- For ASCII (non-UTF-8) strings `edit-distance` is ≈70 times faster than `dld`.
- For UTF-8 strings ≈5 times faster.Here is en example output of the normalized profiling times done with the script "faster-word-distances.raku":
```
StrDistance => 1
dld => 0.847204294559419
edit-distance => 0.011560672845434399
rosetta => 2.5342606961356466
sift => 0.021171925438510746
```**Remark:** The timing of Raku's built-in [`StrDistance`](https://docs.raku.org/type/StrDistance) is used to normalize the rest of the timings.
**Remark:** In the profiling also `sift4` from ["Text::Diff::Sift4"](https://raku.land/github:MasterDuke17/Text::Diff::Sift4), [MDp1], was used.
(NQP-based implementation.)-----
## References
[AAp1] Anton Antonov,
[DSL::Translators Raku package](https://github.com/antononcube/Raku-DSL-Translators),
(2020-2024),
[GitHub/antononcube](https://github.com/antononcube/).[AAp2] Anton Antonov,
[DSL::Shared Raku package](https://github.com/antononcube/Raku-Shared),
(2020-2024),
[GitHub/antononcube](https://github.com/antononcube/).[ITp1] Itsuki Toyota,
[Algorithm::KdTree Raku package](https://github.com/titsuki/p6-Algorithm-KdTree),
(2016-2024),
[GitHub/titsuki](https://github.com/titsuki).[MDp1] MaterDuke17,
[Text::Diff::Sift4 Raku package](https://github.com/MasterDuke17/Text-Diff-Sift4),
(2016-2021),
[GitHub/MaterDuke17](https://github.com/MasterDuke17).[NLp1] Nick Logan,
[Text::Levenshtein::Damerau Raku package](https://github.com/ugexe/Raku-Text--Levenshtein--Damerau),
(2016-2022),
[GitHub/ugexe](https://github.com/ugexe/).