Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lindig/strsim
strsim - compare strings for similarity
https://github.com/lindig/strsim
Last synced: 4 days ago
JSON representation
strsim - compare strings for similarity
- Host: GitHub
- URL: https://github.com/lindig/strsim
- Owner: lindig
- Created: 2012-07-16T14:14:31.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2012-07-16T14:18:13.000Z (over 12 years ago)
- Last Synced: 2023-03-25T00:24:31.253Z (over 1 year ago)
- Language: OCaml
- Size: 89.8 KB
- Stars: 9
- Watchers: 3
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Strsim - compare two strings and emit those that are similar
This is a small utility that compares two strings that it reads from stdin for
similarity and emits those that exceed a given threshold. It was developed for
detecting personal names that are likely to be equal.$ echo "Donald Knuth^Donald E. Knuth" | ./strsim -t '^' -d 0.8
The invocation above would emit the line because it finds the two
strings `Donald Knuth` and `Donald E. Knuth` similar.Several metrics for comparing strings exist where the editing distance it
probably the best known. This tool implements another algorithm that is
relatively robust against swapped characters and small additions and
deletions. The algorithm builds for each string a set of all adjacent
characters (a set of pairs) and compares these:Let x and y be strings and xs and ys the corresponding sets of adjacent pairs
from these string and ss their intersection. The similarity s of x and y is
computed ass = (2*|ss|)/(|xs|+|ys|)
where |xs| denotes the cardinality of set |xs|. Example:x = hello
y = hallo (German for hello)
xs = {he,el,ll,lo}
ys = {ha,al,ll,lo}
ss = {ll,lo}s = (2*2)/(4+3) = 4/7 = 0.57
## Usage and Options./strsim -h
usage: strsim optionsstrsim reads lines from stdin, splits them in two halfs
and emits all lines whose half exceed a given similarity
threshold in range 0.0..1.0options:
-t c split input lines at character c; default is tab
-d 0.8 emit lines with similarity of 0.8 or greater; default 0.9
-h emit this help to stderrStrsim reads input line by line from stdin and splits each line into two
strings which it compares. The line is spilt at the first tab character, or
the character by option `-t` if provided. The threshold that needs to be
exceeded is 0.9 by default and is likewise controlled by option `-d`.## Building
Strsim is implemented in Objective Caml. To build it, simply invoke Make:
$ make
The `Makefile` relies on `ocamlbuild` for the actual build process.## Author
Christian Lindig
## Copyright
This code is in the public domain.