https://github.com/russHyde/dupree
{dupree} helps identify code blocks that have a high level of similarity in a set of R files
https://github.com/russHyde/dupree
Last synced: 4 months ago
JSON representation
{dupree} helps identify code blocks that have a high level of similarity in a set of R files
- Host: GitHub
- URL: https://github.com/russHyde/dupree
- Owner: russHyde
- License: other
- Created: 2018-09-03T15:13:43.000Z (over 6 years ago)
- Default Branch: main
- Last Pushed: 2024-04-03T10:55:09.000Z (about 1 year ago)
- Last Synced: 2024-11-27T12:04:48.161Z (5 months ago)
- Language: R
- Homepage: https://russhyde.github.io/dupree/
- Size: 279 KB
- Stars: 37
- Watchers: 3
- Forks: 0
- Open Issues: 20
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - russHyde/dupree - {dupree} helps identify code blocks that have a high level of similarity in a set of R files (R)
README
---
output: github_document
---```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```[](https://codecov.io/gh/russHyde/dupree?branch=main)
[](https://github.com/russHyde/dupree/actions)# dupree
The goal of `dupree` is to identify chunks / blocks of highly duplicated code
within a set of R scripts.A very lightweight approach is used:
- The user provides a set of `*.R` and/or `*.Rmd` files;
- All R-code in the user-provided files is read and code-blocks are identified;
- The non-trivial symbols from each code-block are retained (for instance,
really common symbols like `<-`, `,`, `+`, `(` are dropped);- Similarity between different blocks is calculated using `stringdist::seq_sim`
by longest-common-subsequence (symbol-identity is at whole-word level - so
"my_data", "my_Data", "my.data" and "myData" are not considered to be identical
in the calculation - and all non-trivial symbols have equal weight in the
similarity calculation);- Code-blocks pairs (both between and within the files) are returned in order
of highest similarityTo prevent the results being dominated by high-identity blocks containing very
few symbols (eg, `library(dplyr)`) the user can specify a `min_block_size`. Any
code-block containing at least this many non-trivial symbols will be kept.## Installation
You can install `dupree` from github with:
```{r gh-installation, eval = FALSE}
if (!"dupree" %in% installed.packages()) {
# Alternatively:
# install.packages("dupree")
remotes::install_github("russHyde/dupree")
}
```## Example
To run `dupree` over a set of R files, you can use the `dupree()`,
`dupree_dir()` or `dupree_package()` functions. For example, to identify
duplication within all of the `.R` and `.Rmd` files for the `dupree` package
you could run the following:```{r example}
## basic example code
library(dupree)files <- dir(pattern = "*.R(md)*$", recursive = TRUE)
dupree(files)
```Any top-level code blocks that contain at least
`r formals(dupree)$min_block_size` non-trivial tokens are
included in the above analysis (a token being a function or variable name, an
operator etc; but ignoring comments, white-space and some really common tokens:
`[](){}-+$@:,=`, `<-`, `&&` etc). To be more restrictive, you could consider
larger code-blocks (increase `min_block_size`) within just the `./R/` source
code directory:```{r}
# R-source code files in the ./R/ directory of the dupree package:
source_files <- dir(path = "./R", pattern = "*.R(md)*$", full.names = TRUE)# analyse any code blocks that contain at least 50 non-trivial tokens
dupree(source_files, min_block_size = 50)
```For each (sufficiently big) code block in the provided files, `dupree` will
return the code-block that is most-similar to it (although any given block
may be present in the results multiple times if it is the closest match for
several other code blocks).Code block pairs with a higher `score` value are more similar. `score` lies in
the range [0, 1]; and is calculated by the
[`stringdist`](https://github.com/markvanderloo/stringdist) package: matching
occurs at the token level: the token "my_data" is no more similar to the token
"myData" than it is to "x".If you find code-block-pairs with a similarity score much greater than 0.5
there is probably some commonality that could be abstracted away.----
Note that you can do something similar using the functions `dupree_dir` and
(if you are analysing a package) `dupree_package`.```{r}
# Analyse all R files in the R/ directory:
dupree_dir(".", filter = "R/")
``````{r}
# Analyse all R files except those in the tests / presentations directories:
# `dupree_dir` uses grep-like arguments
dupree_dir(
".",
filter = "tests|presentations", invert = TRUE
)
``````{r}
# Analyse all R source code in the package (only looking at the ./R/ directory)
dupree_package(".")
```