https://github.com/proycon/sesdiff
Generates a shortest edit script (Myers' diff algorithm) to indicate how to get from the strings in column A to the strings in column B. Also provides the edit distance (levenshtein).
https://github.com/proycon/sesdiff
diff levenshtein nlp shortest-edit-script
Last synced: 3 months ago
JSON representation
Generates a shortest edit script (Myers' diff algorithm) to indicate how to get from the strings in column A to the strings in column B. Also provides the edit distance (levenshtein).
- Host: GitHub
- URL: https://github.com/proycon/sesdiff
- Owner: proycon
- License: gpl-3.0
- Created: 2020-08-14T09:45:45.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2024-10-15T08:58:20.000Z (12 months ago)
- Last Synced: 2025-04-23T04:09:10.480Z (6 months ago)
- Topics: diff, levenshtein, nlp, shortest-edit-script
- Language: Rust
- Homepage: https://git.sr.ht/~proycon/sesdiff
- Size: 76.2 KB
- Stars: 7
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://crates.io/crates/sesdiff)
[](https://github.com/proycon/sesdiff/actions/)
[](https://GitHub.com/proycon/sesdiff/releases/)
[](https://www.repostatus.org/#active)# sesdiff: Shortest Edit Script Diff
## Description
This is a small and fast command line tool and Rust library that reads a two-column tab separated input from standard input and computes the shortest edit script (Myers' diff algorithm) to go from the string in column A to the string in column B. It also computes the edit distance (aka levenshtein distance).
There is also a [python binding](python/) available if you want to use sesdiff
from Python. The documentation here covers the command-line version.It was written to build lemmatisers.
## Installation
Install it using Rust's package manager:
```
cargo install sesdiff
```No cargo/rust on your system yet? Do ``sudo apt install cargo`` on Debian/ubuntu based systems, ``brew install rust`` on mac, or use [rustup](https://rustup.rs/).
This tool builds upon [Dissimilar](https://crates.io/crates/dissimilar) that provides the actual diff algorithm (will be
downloaded and compiled in automatically).## Usage
```
$ sesdiff < input.tsv
```Example input and output (reformatted for legibility, the first two columns correspond to the input). Output is in a four-column tab separated format:
```
hablaron hablar =[hablar]-[on] 2
contaron contar =[contar]-[on] 2
pidieron pedir =[p]-[i]+[e]=[di]-[eron]+[r] 6
говорим говорить =[говори]-[м]+[ть] 3
```By default the full edit script will be provided in a simple language:
* ``=[]`` - The text between brackets is identical in strings A and B
* ``=[#n]`` - If you use the ``--abstract`` parameter, this will be used instead, where ``n`` represents a number
indicating the length of text between that is identical in strings A and B
* ``-[]`` - The text between brackets is removed to get to string B
* ``+[]`` - The text between brackets is added to get to string BFor lemmatisation purposes, it makes sense for many languages to look at
suffixes (from right to left) and strip common prefixes. Pass the ``--suffix``
option for that behaviour and output is now:```
$ sesdiff --suffix < input.tsv
hablaron hablar -[on] 2
contaron contar -[on] 2
pidieron pedir -[eron]+[r]=[di]-[i]+[e] 6
говорим говорить -[м]+[ть] 3
```Note that the edit scripts in suffix mode are formulated differently than in normal mode (they start from the right
too). There is also a ``--prefix`` option that strips common suffixes.Use the ``--abstract`` parameter to get a slightly more abstract edit script that refers to the length of unchanged parts
rather than their contents. You would then get:```
pidieron pedir -[eron]+[r]=[#2]-[i]+[e] 6
```Sesdiff can also apply edit scripts to our input, use the ``--apply`` flag and feed the tool tab separated input with
a string in the first column and an edit script in the second, as in the the following example ``input2.tsv``:```
$ cat input2.tsv
pidieron -[eron]+[r]=[di]-[i]+[e]
```Run sesdiff as follows and a third column will be added with the solution:
```
$ sesdiff --suffix --apply < input2.tsv
pidieron -[eron]+[r]=[di]-[i]+[e] pedir
```When using ``--apply``, you can also make use of an extra ``--infix`` parameter to indicate that an edit script must be
attempted to be matched with any infix in the string, including multiple. Consider the following example that replaces
all letters *a* with *o*:```
$ cat input3.tsv
hahaha -[a]+[o]$ sesdiff --infix --apply < input3.tsv
hahaha -[a]+[o] hohoho
```In ``--apply`` mode, you can also make edit scripts applicable to multiple patterns by using the ``|`` operator. This is
only allowed for deletions (``-[]``) and equality checks (``=[]``):```
$ cat input4.tsv
hihaho -[a|i|o]+[e]$ sesdiff --infix --apply < input4.tsv
hihaho -[a|i|o]+[e] hehehe
```# License
GNU General Public Licence v3