https://github.com/russHyde/dupree

{dupree} helps identify code blocks that have a high level of similarity in a set of R files
https://github.com/russHyde/dupree

Last synced: 4 months ago
JSON representation

{dupree} helps identify code blocks that have a high level of similarity in a set of R files

Host: GitHub
URL: https://github.com/russHyde/dupree
Owner: russHyde
License: other
Created: 2018-09-03T15:13:43.000Z (over 7 years ago)
Default Branch: main
Last Pushed: 2024-04-03T10:55:09.000Z (over 1 year ago)
Last Synced: 2024-11-27T12:04:48.161Z (about 1 year ago)
Language: R
Homepage: https://russhyde.github.io/dupree/
Size: 279 KB
Stars: 37
Watchers: 3
Forks: 0
Open Issues: 20
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

jimsghstars - russHyde/dupree - {dupree} helps identify code blocks that have a high level of similarity in a set of R files (R)

README

          ---

output: github_document

---

```{r, echo = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "README-"

)

```

[![Codecov test coverage](https://codecov.io/gh/russHyde/dupree/branch/main/graph/badge.svg)](https://codecov.io/gh/russHyde/dupree?branch=main)

[![R-CMD-check](https://github.com/russHyde/dupree/workflows/R-CMD-check/badge.svg)](https://github.com/russHyde/dupree/actions)

# dupree

The goal of `dupree` is to identify chunks / blocks of highly duplicated code

within a set of R scripts.

A very lightweight approach is used:

- The user provides a set of `*.R` and/or `*.Rmd` files;

- All R-code in the user-provided files is read and code-blocks are identified;

- The non-trivial symbols from each code-block are retained (for instance,

really common symbols like `<-`, `,`, `+`, `(` are dropped);

- Similarity between different blocks is calculated using `stringdist::seq_sim`

by longest-common-subsequence (symbol-identity is at whole-word level - so

"my_data", "my_Data", "my.data" and "myData" are not considered to be identical

in the calculation - and all non-trivial symbols have equal weight in the

similarity calculation);

- Code-blocks pairs (both between and within the files) are returned in order

of highest similarity

To prevent the results being dominated by high-identity blocks containing very

few symbols (eg, `library(dplyr)`) the user can specify a `min_block_size`. Any

code-block containing at least this many non-trivial symbols will be kept.

## Installation

You can install `dupree` from github with:

```{r gh-installation, eval = FALSE}

if (!"dupree" %in% installed.packages()) {

  # Alternatively:

  # install.packages("dupree")

  remotes::install_github("russHyde/dupree")

}

```

## Example

To run `dupree` over a set of R files, you can use the `dupree()`,

`dupree_dir()` or `dupree_package()` functions. For example, to identify

duplication within all of the `.R` and `.Rmd` files for the `dupree` package

you could run the following:

```{r example}

## basic example code

library(dupree)

files <- dir(pattern = "*.R(md)*$", recursive = TRUE)

dupree(files)

```

Any top-level code blocks that contain at least 

`r formals(dupree)$min_block_size` non-trivial tokens are

included in the above analysis (a token being a function or variable name, an

operator etc; but ignoring comments, white-space and some really common tokens:

`[](){}-+$@:,=`, `<-`, `&&` etc). To be more restrictive, you could consider

larger code-blocks (increase `min_block_size`) within just the `./R/` source

code directory:

```{r}

# R-source code files in the ./R/ directory of the dupree package:

source_files <- dir(path = "./R", pattern = "*.R(md)*$", full.names = TRUE)

# analyse any code blocks that contain at least 50 non-trivial tokens

dupree(source_files, min_block_size = 50)

```

For each (sufficiently big) code block in the provided files, `dupree` will

return the code-block that is most-similar to it (although any given block

may be present in the results multiple times if it is the closest match for

several other code blocks).

Code block pairs with a higher `score` value are more similar. `score` lies in

the range [0, 1]; and is calculated by the

[`stringdist`](https://github.com/markvanderloo/stringdist) package: matching

occurs at the token level: the token "my_data" is no more similar to the token

"myData" than it is to "x".

If you find code-block-pairs with a similarity score much greater than 0.5

there is probably some commonality that could be abstracted away.

----

Note that you can do something similar using the functions `dupree_dir` and

(if you are analysing a package) `dupree_package`.

```{r}

# Analyse all R files in the R/ directory:

dupree_dir(".", filter = "R/")

```

```{r}

# Analyse all R files except those in the tests / presentations directories:

# `dupree_dir` uses grep-like arguments

dupree_dir(

  ".",

  filter = "tests|presentations", invert = TRUE

)

```

```{r}

# Analyse all R source code in the package (only looking at the ./R/ directory)

dupree_package(".")

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/russHyde/dupree

Awesome Lists containing this project

README