{"id":13468736,"url":"https://github.com/platisd/duplicate-code-detection-tool","last_synced_at":"2025-04-07T11:09:00.674Z","repository":{"id":38405336,"uuid":"157110425","full_name":"platisd/duplicate-code-detection-tool","owner":"platisd","description":"A simple Python3 tool to detect similarities between files within a repository","archived":false,"fork":false,"pushed_at":"2024-06-01T14:22:17.000Z","size":56,"stargazers_count":163,"open_issues_count":4,"forks_count":30,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-10-13T16:27:41.911Z","etag":null,"topics":["code-duplication","gensim","nlp"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/platisd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-11T18:52:16.000Z","updated_at":"2024-09-20T19:43:49.000Z","dependencies_parsed_at":"2024-06-18T16:47:28.000Z","dependency_job_id":"6f925553-3e34-45d6-8d7c-47a34f0b4a4f","html_url":"https://github.com/platisd/duplicate-code-detection-tool","commit_stats":{"total_commits":58,"total_committers":6,"mean_commits":9.666666666666666,"dds":0.1724137931034483,"last_synced_commit":"c5b6b0e974c358e4a736a4cdfdbc595fb85b9b89"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/platisd%2Fduplicate-code-detection-tool","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/platisd%2Fduplicate-code-detection-tool/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/platisd%2Fduplicate-code-detection-tool/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/platisd%2Fduplicate-code-detection-tool/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/platisd","download_url":"https://codeload.github.com/platisd/duplicate-code-detection-tool/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247640465,"owners_count":20971557,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code-duplication","gensim","nlp"],"created_at":"2024-07-31T15:01:17.861Z","updated_at":"2025-04-07T11:09:00.634Z","avatar_url":"https://github.com/platisd.png","language":"Python","readme":"# Duplicate Code Detection Tool\nA simple Python3 tool (also available as a [GitHub Action](#github-action)) to detect\nsimilarities between files within a repository.\n\n## What?\nA command line tool that receives a directory or a list of files and determines\nthe degree of similarity between them.\n\n## Why?\nThe tool intends guide the refactoring efforts of a developer who wishes\nto reduce code duplication within a component and improve its software\narchitecture.\n\nIts development was initiated within the context of the\n[DAT265 - Software Evolution Project](https://pingpong.chalmers.se/public/courseId/9754/lang-en/publicPage.do).\n\n## How?\nThe tool uses the [gensim](https://radimrehurek.com/gensim/) Python library to\ndetermine the similarity between source code files, supplied by the user.\nThe default supported languages are C, C++, JAVA, Python and C#.\n\n### Dependencies\nThe following Python packages have to be installed:\n  * nltk\n    * `pip3 install --user nltk`\n  * gensim\n    * `pip3 install --user gensim`\n  * astor\n    * `pip3 install --user astor`\n  * punkt\n    * `python3 -m nltk.downloader punkt`\n\n## Get started\nSuppress the warnings (generated by the used libraries)\nas `python3 -W ignore duplicate_code_detection.py` and then supply the necessary\narguments. More details can be found by running the tool with the `--help` option.\n\n**Notice:** Due to the way the models are created, the more source files you\nprovide the tool the more accurate the similarity calculations are. In other\nwords, the bigger the project, the more useful the tool is.\n\n### Example\nIf `duplicate-code-detection-tool` is the name where the tool resides in and\n`smartcar_shield/src` contains the repository you want to check for source code\nsimilarities between the files, then you can run the following to get the\nsimilarity report:\n\n`python3 -W ignore duplicate-code-detection-tool/duplicate_code_detection.py -d smartcar_shield/src/`\n\nThe result should look something like this:\n\n![code duplication tool screenshot](https://i.imgur.com/wi1TnVM.png)\n\n## GitHub Action\n\nThe tool is also available as a [GitHub Action](https://docs.github.com/en/actions) for easy integration\nwith projects hosted on GitHub. An example output of the tool can be seen\n[here](https://github.com/platisd/smartcar_shield/pull/36#issuecomment-778635111).\n\nThe Action is meant to be triggered during **pull requests** to give the developers an impression\nover the **degree of similarity** between the files in the source code. Below you will find a sample\nworkflow files that illustrate the usage.\n\nDepending on the *size* of your project, you may want to have the tool running multiple times\n(i.e in diffferent steps) that test specific parts of your repository for duplicate code.\nThis way you will not compare each file in your codebase with everything else and get back more\nmeaningful reports.\n\n### Bare minimum\n\nIn the following example the tool will examine source code (the languages supported by default)\nin the `src/` and `test/ut` directories *relative* to the root directory of your repository.\nThe results will be posted as a comment in the **pull request** that was opened.\n\n```yaml\nname: Duplicate code\n\non: pull_request\n\njobs:\n  duplicate-code-check:\n    name: Check for duplicate code\n    runs-on: ubuntu-20.04\n    steps:\n      - name: Check for duplicate code\n        uses: platisd/duplicate-code-detection-tool@master\n        with:\n          github_token: ${{ secrets.GITHUB_TOKEN }}\n          directories: \"src/, test/ut\"\n```\n\n### Trigger on pull request comment\n\nIf you want to avoid the \"spam\" you should configure the tool to not always run. Specifically, if you\nwish to trigger the Action manually, you can do so by leaving a comment in the pull request.\n\nThe following action will trigger the tool to be run when a comment containig `run_duplicate_code_detection_tool`\nis posted in a pull request. The tool will run using the code in the pull request.\n\n```yaml\nname: Duplicate code\n\non: issue_comment\n\njobs:\n  duplicate-code-check:\n    name: Check for duplicate code\n    # Trigger the tool only when a comment containing the keyword is published in a pull request\n    if: github.event.issue.pull_request \u0026\u0026 contains(github.event.comment.body, 'run_duplicate_code_detection_tool')\n    runs-on: ubuntu-20.04\n    steps:\n      - name: Check for duplicate code\n        uses: platisd/duplicate-code-detection-tool@master\n        with:\n          github_token: ${{ secrets.GITHUB_TOKEN }}\n          directories: \".\"\n```\n\n**Important:** Please note that due to the way GitHub Actions work, you will *first* have to merge this into your main\nbranch so it starts taking effect.\n\n### Optional configuration\n\nIt may not make sense to compare all files or get a files with very low similarity reported.\nIn the following workflow, the different *optional* arguments are demonstrated.\n\nFor the various default values, please consult [action.yml](action.yml).\n\n```yaml\nname: Duplicate code\n\non: pull_request\n\njobs:\n  duplicate-code-check:\n    name: Check for duplicate code\n    runs-on: ubuntu-20.04\n    steps:\n      - name: Check for duplicate code\n        uses: platisd/duplicate-code-detection-tool@master\n        with:\n          github_token: ${{ secrets.GITHUB_TOKEN }}\n          directories: \"src\"\n          # Ignore the specified directories\n          ignore_directories: \"src/external_libraries\"\n          # Only examine .h and .cpp files\n          file_extensions: \"h, cpp\"\n          # Only report similarities above 5%\n          ignore_below: 5\n          # If a file is more than 70% similar to another, then the job fails\n          fail_above: 70\n          # If a file is more than 15% similar to another, show a warning symbol in the report\n          warn_above: 15\n          # Remove `src/` from the file paths when reporting similarities\n          project_root_dir: \"src\"\n          # Remove docstrings from code before analysis\n          # For python source code only. This is checked on a per-file basis\n          only_code: true\n          # Leave only one comment with the report and update it for consecutive runs\n          one_comment: true\n          # The message to be displayed at the start of the report\n          header_message_start: \"The following files have a similarity above the threshold:\"\n```\n## Using duplicate-code-check with pre-commit\nTo use Duplicate Code Detection Tool as a pre-commit hook with [pre-commit](https://pre-commit.com/) add the following to your `.pre-commit-config.yaml` file:\n```yaml\n-   repo: https://github.com/platisd/duplicate-code-detection-tool.git\n    rev: ''  # Use the sha / tag you want to point at\n    hooks:\n    -   id: duplicate-code-detection\n```\n\u003e **_NOTE:_** that this repository sets args: `-f`, if you are configuring duplicate-code-detection-tool using args you'll want to include either `-f` (`--files`) or `-d` (`--directories`).\n\n## Limitations\n\n- `only_code` option only works with python files for now\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplatisd%2Fduplicate-code-detection-tool","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fplatisd%2Fduplicate-code-detection-tool","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplatisd%2Fduplicate-code-detection-tool/lists"}