https://github.com/tasleson/duplihere

Copy & Paste finder for structured text files.
https://github.com/tasleson/duplihere

clones-detection code-quality copy-paste cpd detect-duplications detector developer-tools duplicate-detection duplicates duplications quality research rust

Last synced: 7 months ago
JSON representation

Copy & Paste finder for structured text files.

Host: GitHub
URL: https://github.com/tasleson/duplihere
Owner: tasleson
License: gpl-3.0
Created: 2019-10-17T15:59:00.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2024-10-09T17:41:54.000Z (9 months ago)
Last Synced: 2024-11-25T20:39:19.672Z (7 months ago)
Topics: clones-detection, code-quality, copy-paste, cpd, detect-duplications, detector, developer-tools, duplicate-detection, duplicates, duplications, quality, research, rust
Language: Rust
Homepage:
Size: 104 KB
Stars: 7
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# duplihere

### What
Copy & Paste finder for source files or any structured utf-8 text files.

### Why

A number of different copy and paste detectors exist. Some examples include:

* [PMD](https://pmd.github.io/)
* [Simian](http://www.harukizaemon.com/simian/)
* [tctoolkit](https://github.com/nitinbhide/tctoolkit)

So why write another? I've wanted a simple tool, one that works like simian,
but is open source and free for everyone. Thus this project was born. In
general I think writing a lexer and tokenizing the source isn't needed.
There is a ton of code that is very much copy and pasted verbatim.
Developers are lazy, they don't change things :-)

### How

```bash
duplihere - 0.9.0 - find duplicate text

usage: duplihere [-pj -l -i -t ] -f

Find duplicate lines of text in one or more text files.

The duplicated text can be at different levels of indention,
but otherwise needs to be identical.

More information: https://github.com/tasleson/duplihere

argument: description
-p, --print print duplicate text [default: false]
-j, --json output JSON [default: false]
-l, --lines minimum number of duplicate lines [default: 6]
-f, --file pattern or file eg. "**/*.[h|c]" recursive, "*.py", "file.ext", can repeat [required]
-i, --ignore file containing hash values to ignore, one per line
-t, --threads number of threads to utilize. Set to 0 to match #cpu cores [default: 4]
```

An example where we re-curse in a directory for python files and a directory
that contains python files ...
```bash
$ duplihere -l 10 -p -f '/home/user/somewhere/**/*.py' -f '/tmp/*.py'
```

An example showing JSON output (not finalized)

```bash
$ duplihere -f /home/tasleson/projects/linux/init/main.c -l 5 -j
```

```json
{
"num_lines": 5,
"num_ignored": 0,
"duplicates": [
{
"key": 11558319874972720381,
"num_lines": 5,
"files": [
[
"/home/tasleson/projects/linux/init/main.c",
830
],
[
"/home/tasleson/projects/linux/init/main.c",
864
]
]
}
]
}

```

### Status

Tool has enough features and functionality for meaningful results.
With the latest multi-thread support it's quite fast on
big source trees. Current graph of memory and CPU consumption while examining
the Linux kernel source tree for duplicates. Run against Linux `6.5` branch (~24M lines) and all
available CPU cores. Chart generated with
[psrecord](https://github.com/astrofrog/psrecord).

![threadripper](https://github.com/tasleson/duplihere/assets/2520480/56e59144-e5b0-415c-90f0-b9459006f686)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tasleson/duplihere

Awesome Lists containing this project

README