Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/smimram/levenfind
Find pairs of similar files according to Levenshtein distance.
https://github.com/smimram/levenfind
edit-distance levenshtein-distance ocaml
Last synced: 23 days ago
JSON representation
Find pairs of similar files according to Levenshtein distance.
- Host: GitHub
- URL: https://github.com/smimram/levenfind
- Owner: smimram
- License: gpl-3.0
- Created: 2020-11-24T15:20:25.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2024-12-18T10:10:07.000Z (23 days ago)
- Last Synced: 2024-12-18T11:28:30.144Z (23 days ago)
- Topics: edit-distance, levenshtein-distance, ocaml
- Language: OCaml
- Homepage:
- Size: 46.9 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.md
- License: LICENSE
Awesome Lists containing this project
README
levenfind
=========A tool to find pairs of similar files, typically for checking that your students
don't cheat. It works on text files so it should be pretty much agnostic with
respect to the language used in the files (be it natural language or code). For
now, we use the [Levenshtein
distance](https://en.wikipedia.org/wiki/Levenshtein_distance) (aka _edit
distance_) in order to compare contents.It takes all the file in a directory (by default the current one) and shows all
pairs of files whose similarity is above a given threshold (60% by default). The
algorithm is quadratic, don't be surprised if it takes some time on directories
with a few files, especially if some of those are big.## Usage
```bash
levenfind directory
```Useful options include
- `--extension` in order to specify the extension of files to consider,
- `--lines` in order to compare files line by line instead of character by
character (this is much faster, but will consider slightly different lines as
distinct),
- `--threshold` in order to specify the threshold of above which similar files
should be reported.