Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/smimram/levenfind

Find pairs of similar files according to Levenshtein distance.
https://github.com/smimram/levenfind

edit-distance levenshtein-distance ocaml

Last synced: 23 days ago
JSON representation

Find pairs of similar files according to Levenshtein distance.

Awesome Lists containing this project

README

        

levenfind
=========

A tool to find pairs of similar files, typically for checking that your students
don't cheat. It works on text files so it should be pretty much agnostic with
respect to the language used in the files (be it natural language or code). For
now, we use the [Levenshtein
distance](https://en.wikipedia.org/wiki/Levenshtein_distance) (aka _edit
distance_) in order to compare contents.

It takes all the file in a directory (by default the current one) and shows all
pairs of files whose similarity is above a given threshold (60% by default). The
algorithm is quadratic, don't be surprised if it takes some time on directories
with a few files, especially if some of those are big.

## Usage

```bash
levenfind directory
```

Useful options include

- `--extension` in order to specify the extension of files to consider,
- `--lines` in order to compare files line by line instead of character by
character (this is much faster, but will consider slightly different lines as
distinct),
- `--threshold` in order to specify the threshold of above which similar files
should be reported.