https://github.com/jopereira/dude
DuDe is a duplication detector for text files
https://github.com/jopereira/dude
Last synced: 27 minutes ago
JSON representation
DuDe is a duplication detector for text files
- Host: GitHub
- URL: https://github.com/jopereira/dude
- Owner: jopereira
- Created: 2013-04-24T10:51:41.000Z (almost 12 years ago)
- Default Branch: master
- Last Pushed: 2013-04-24T10:52:05.000Z (almost 12 years ago)
- Last Synced: 2024-11-08T09:43:23.227Z (5 months ago)
- Language: Java
- Homepage:
- Size: 117 KB
- Stars: 5
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- project-awesome - jopereira/dude - DuDe is a duplication detector for text files (Java)
README
DuDe, the duplication detector for text files
=============================================DuDe finds duplicate segments in text, even when it has been edited
and reformatted, namely, by inserting or removing words, or by
breaking lines.Given a set o text files, it lists the total number of characters in
shared segments for file pairs that exceed a minimum sharing
threshold.This can be presented as a graph, with one node for each file and one
edge for each pair found. The shared portions for each pair can also
be shown in detail.This works only in text files. Other file types, such as .doc or .pdf
need first to be converted to plain text.Building and running
--------------------To build use `mvn package`. To run, use `java -jar target/dude-
-jar-with-dependencies.jar`. Running it without
any arguments presents a brief help on command line syntax.How does it work
----------------DuDe is a quick hack. It uses Rabin fingerprints to chop input
files in variable sized chunks and Bloom filters for indexing.
The following excellent libraries do the heavy lifting:
* https://github.com/themadcreator/rabinfingerprint
* http://code.google.com/p/guava-libraries/Source code
-----------Latest source is available at http://github.com/jopereira/dude
License
-------DuDe is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.DuDe is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.You should have received a copy of the GNU General Public License
along with DuDe. If not, see .