{"id":15398366,"url":"https://github.com/kirillbobyrev/code-clone-detection-llvm-devmtg15-poster","last_synced_at":"2025-04-16T01:19:41.530Z","repository":{"id":77326289,"uuid":"44608591","full_name":"kirillbobyrev/code-clone-detection-llvm-devmtg15-poster","owner":"kirillbobyrev","description":"Code Clone Detection in Clang Static Analyzer poster for LLVM Developers' Meeting 2015.","archived":false,"fork":false,"pushed_at":"2021-05-19T20:07:41.000Z","size":334,"stargazers_count":8,"open_issues_count":1,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-29T03:23:01.125Z","etag":null,"topics":["c-plus-plus","clang","clang-static-analyzer","llvm","poster","research","static-analysis"],"latest_commit_sha":null,"homepage":"https://github.com/kirillbobyrev/code-clone-detection-llvm-devmtg15-poster/blob/main/poster.pdf","language":"TeX","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kirillbobyrev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-10-20T13:35:59.000Z","updated_at":"2023-04-14T05:32:09.000Z","dependencies_parsed_at":null,"dependency_job_id":"f8fa8024-d784-40e9-bd2e-a241a48dbde5","html_url":"https://github.com/kirillbobyrev/code-clone-detection-llvm-devmtg15-poster","commit_stats":{"total_commits":24,"total_committers":4,"mean_commits":6.0,"dds":0.5416666666666667,"last_synced_commit":"dcb08274cbce4a9ea3fc5da1ed6292d81d8b8d53"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kirillbobyrev%2Fcode-clone-detection-llvm-devmtg15-poster","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kirillbobyrev%2Fcode-clone-detection-llvm-devmtg15-poster/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kirillbobyrev%2Fcode-clone-detection-llvm-devmtg15-poster/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kirillbobyrev%2Fcode-clone-detection-llvm-devmtg15-poster/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kirillbobyrev","download_url":"https://codeload.github.com/kirillbobyrev/code-clone-detection-llvm-devmtg15-poster/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249178869,"owners_count":21225449,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c-plus-plus","clang","clang-static-analyzer","llvm","poster","research","static-analysis"],"created_at":"2024-10-01T15:42:52.854Z","updated_at":"2025-04-16T01:19:41.524Z","avatar_url":"https://github.com/kirillbobyrev.png","language":"TeX","funding_links":[],"categories":[],"sub_categories":[],"readme":"Table of Contents\n=================\n\n   * [code-clone-detection-llvm-devmtg15-poster](#code-clone-detection-llvm-devmtg15-poster)\n      * [Description](#description)\n      * [Google Summer of Code 2015](#google-summer-of-code-2015)\n      * [Motivation](#motivation)\n      * [Key results](#key-results)\n      * [Future work](#future-work)\n      * [Acknowledgement](#acknowledgement)\n      * [Similar pieces of code found using proposed\n        technique](#similar-pieces-of-code-found-using-proposed-technique)\n         * [OpenSSL](#openssl)\n            * [crypto/bf/bf_cfb64.c](#cryptobfbf_cfb64c)\n            * [crypto/ec/ecp_smpl.c](#cryptoececp_smplc)\n            * [crypto/rsa/rsa_x931g.c](#cryptorsarsa_x931gc)\n         * [Vim](#vim)\n            * [src/farsi.c](#srcfarsic)\n            * [evalfunc.c](#evalfuncc)\n         * [Git](#git)\n            * [remote-curl.c](#remote-curlc)\n            * [xdiff/xprepare.c](#xdiffxpreparec)\n      * [Notes](#notes)\n\n# code-clone-detection-llvm-devmtg15-poster\n\nThis repository contains LaTeX source code for the \"Code Clone Detection in\nClang Static Analyzer\" poster.\n\n[Compiled PDF version](./poster.pdf).\n\n## Description\n\nThe poster was prepared for [LLVM Developer's Meeting\n2015](http://llvm.org/devmtg/2015-10/). LLVM Developers' Meeting is the\nlargest conference for compiler specialists from all over the world held every\nyear. Dozens of engineers working at Google, Apple and Intel are attending the\nconference and exchanging valuable experience.\n\nThis research resulted in `alpha.clone.CloneChecker` in [Clang Static\nAnalyzer](http://clang-analyzer.llvm.org/index.html). This check is capable of\ndetecting a part of what the original implementation was able to detect, but\nis more stable and production-ready.\n\n[Unit\ntests](https://github.com/kirillbobyrev/clang/tree/master/test/Analysis/copypaste)\ngive a good overview of code pieces, which can be detected by current upstream\nimplementation. `CloneChecker` is a part of Clang Static Analyzer, which is\nshipped with Clang binary. To run this check on a custom file just install\nClang and type the following command.\n\n`$ clang++ -cc1 -analyze -analyzer-checker=alpha.clone.CloneChecker source.cpp`\n\n## Google Summer of Code 2015\n\nThe work described in this poster was done in terms of [Google Summer of Code\n2015](https://developers.google.com/open-source/gsoc/) and supported by Google.\n\nI was working with LLVM Community under mentorship of [Vassil\nVassilev](https://github.com/vgvassilev)\n([CERN](https://home.cern/)/[FNAL](http://www.fnal.gov/)). Vassil is a known\ncompiler specialist and the creator of [Cling](https://root.cern.ch/cling), an\ninteractive C++ interpreter used in CERN.\n\nThe [GSoC project\npage](https://www.google-melange.com/archive/gsoc/2015/orgs/llvm/projects/arcadiaq.html)\nsadly doesn't contain much information due to my lack of knowledge that a large\npart of the summary I wrote won't be accessible from there. This poster,\nthough, contains an extensive overview of the work done during Summer 2015.\n\n## Motivation\n\nDespite Code Clone Detection being quite popular topic over the past years\nthere was no good solution, which was extensible, open, easy-to-use and would\nactually do its job really good. While some solutions existed most of them\ndidn't take advantage of modern compiler technologies and very naive attempts\nlead to detecting a small part of widely existing code clones.\n\nNumerous research papers proposed text-based approach, which is both not\nscalable and inefficient. Only few attempts (most notably, [this\none](http://www.semanticdesigns.com/Company/Publications/ICSM98.pdf)) focused\non AST analysis, which gives significantly better results. Taking an advantage\nof reusing Clang infrastructure, which provides a rich AST for C, C++ and\nObjective-C, leads to even better solution.\n\n## Key results\n\nReusing Clang infrastructure allows to parse the up-to-date C and C++ dialects\nand detecting very sophisticated code clone instances.\n\nThe poster shows that the proposed implementation outperforms existing\nsolutions (those I am aware of) in performance, range of detected code clones\nand usability. Many approaches, which are aiming for speed, are not able most\nType II and Type III clones (please refer to Code Clone Taxonomy in\n[Notes](#notes)). Those trying to detect more types of similarity have serious\nperformance issue. My work combines performance efficiency while not limiting\ndetection capabilities.\n\nThe code used for this paper is available in [my fork of Clang\nrepository](https://github.com/omtcyfz/clang/tree/CloneDetection).\n\nThe following table shows that even a naive implementation of Code Clone check\nis able to process huge open-source projects and find many similar pieces of\ncode:\n\n|Project|Normal build time|Build with BasicCloneCheck time|Clones found|\n|---|---|---|---|\n|OpenSSL|1m26s|9m27s|180|\n|Git|0m26s|2m46s|34|\n|SDL|0m26s|1m59s|170| \n\n## Future work\n\nDuring Summer 2016 I was an intern in Google Munich, where I introduced major\nimprovements to [clang-rename](http://clang.llvm.org/extra/clang-rename.html)\nand started clang-refactor (see [design\ndoc](https://docs.google.com/document/d/1w9IkR0_Gqmd5w4CZ2t_ZDZrNLYVirQPyMS41533HQZE)\nfor reference). Therefore I was unable to continue my work on coding side and\nonly participated in few discussions. [Raphael\nIsemann](https://github.com/Teemperor) under mentorship of Vassil Vassilev\nand with the help of Apple Static Anlysis team engineers did a great job\nimproving current infrastructure (see [GSoC project\npage](https://docs.google.com/document/d/1w9IkR0_Gqmd5w4CZ2t_ZDZrNLYVirQPyMS41533HQZE))\nand finally pushing the code to the Clang repository.\n\nClang Static Analyzer isn't able to pass information between translation units\nand this, unfortunately, is a huge limitation for Code Clone Detection because\nof its nature: most clones end up in different translation units and are not\nreported by the check. If a proper solution is to be made, there is a need to\novercome described limitation. My work on clang-refactor might become useful\nfor an efficient solution.\n\n## Acknowledgement\n\nI would like to thank Vassil Vassilev for guidance and support, LLVM Community\nfor great suggestions and all the work done towards supporting new contributors\nand, of course, Google - for creating a great opportunity for students from\nall over the world and funding.\n\n## Similar pieces of code found using proposed technique\n\nCompared to C++ projects, projects written in C suffer from code duplication\nissues significantly more.\n\nThe following pieces of code can be easily wrapped into functions to prevent\npotential errors, such as fixing a bug in one of the clone instances and\nignoring the others.\n\n### OpenSSL\n\nA little more throughout analysis was able to identify around 500 big enough\n(see following examples) code clones.\n\nProject commit used for analysis: b77b6127e8de38726f37697bbbc736ced7b49771.\n\n#### crypto/bf/bf_cfb64.c\n\n```c\n    if (encrypt) {\n        while (l--) {\n            if (n == 0) {\n                n2l(iv, v0);\n                ti[0] = v0;\n                n2l(iv, v1);\n                ti[1] = v1;\n                BF_encrypt((BF_LONG *)ti, schedule);\n                iv = (unsigned char *)ivec;\n                t = ti[0];\n                l2n(t, iv);\n                t = ti[1];\n                l2n(t, iv);\n                iv = (unsigned char *)ivec;\n            }\n            c = *(in++) ^ iv[n];\n            *(out++) = c;\n            iv[n] = c;\n            n = (n + 1) \u0026 0x07;\n        }\n    } else {\n        while (l--) {\n            if (n == 0) {\n                n2l(iv, v0);\n                ti[0] = v0;\n                n2l(iv, v1);\n                ti[1] = v1;\n                BF_encrypt((BF_LONG *)ti, schedule);\n                iv = (unsigned char *)ivec;\n                t = ti[0];\n                l2n(t, iv);\n                t = ti[1];\n                l2n(t, iv);\n                iv = (unsigned char *)ivec;\n            }\n            cc = *(in++);\n            c = iv[n];\n            iv[n] = cc;\n            *(out++) = c ^ cc;\n            n = (n + 1) \u0026 0x07;\n        }\n    }\n```\n\n#### crypto/ec/ecp_smpl.c\n\n```c\n    if (!b-\u003eZ_is_one) {\n        if (!field_sqr(group, Zb23, b-\u003eZ, ctx))\n            goto end;\n        if (!field_mul(group, tmp1, a-\u003eX, Zb23, ctx))\n            goto end;\n        tmp1_ = tmp1;\n    } else\n        tmp1_ = a-\u003eX;\n    if (!a-\u003eZ_is_one) {\n        if (!field_sqr(group, Za23, a-\u003eZ, ctx))\n            goto end;\n        if (!field_mul(group, tmp2, b-\u003eX, Za23, ctx))\n            goto end;\n        tmp2_ = tmp2;\n    } else\n        tmp2_ = b-\u003eX;\n\n    /* compare  X_a*Z_b^2  with  X_b*Z_a^2 */\n    if (BN_cmp(tmp1_, tmp2_) != 0) {\n        ret = 1;                /* points differ */\n        goto end;\n    }\n\n    if (!b-\u003eZ_is_one) {\n        if (!field_mul(group, Zb23, Zb23, b-\u003eZ, ctx))\n            goto end;\n        if (!field_mul(group, tmp1, a-\u003eY, Zb23, ctx))\n            goto end;\n        /* tmp1_ = tmp1 */\n    } else\n        tmp1_ = a-\u003eY;\n    if (!a-\u003eZ_is_one) {\n        if (!field_mul(group, Za23, Za23, a-\u003eZ, ctx))\n            goto end;\n        if (!field_mul(group, tmp2, b-\u003eY, Za23, ctx))\n            goto end;\n        /* tmp2_ = tmp2 */\n    } else\n        tmp2_ = b-\u003eY;\n```\n\n#### crypto/rsa/rsa_x931g.c\n```c\n    if (Xp \u0026\u0026 rsa-\u003ep == NULL) {\n        rsa-\u003ep = BN_new();\n        if (rsa-\u003ep == NULL)\n            goto err;\n\n        if (!BN_X931_derive_prime_ex(rsa-\u003ep, p1, p2,\n                                     Xp, Xp1, Xp2, e, ctx, cb))\n            goto err;\n    }\n\n    if (Xq \u0026\u0026 rsa-\u003eq == NULL) {\n        rsa-\u003eq = BN_new();\n        if (rsa-\u003eq == NULL)\n            goto err;\n        if (!BN_X931_derive_prime_ex(rsa-\u003eq, q1, q2,\n                                     Xq, Xq1, Xq2, e, ctx, cb))\n            goto err;\n    }\n```\n\n### Vim\n\nPatch 8.0.0071.\n\nAround 300 similar code pieces in total found in Vim.\n\n#### src/farsi.c\n\n```c\n    // Chunk 1.\n    switch (gchar_cursor())\n    {\n\tcase ALEF:\n\t\ttempc = ALEF_;\n\t\tbreak;\n\tcase ALEF_U_H:\n\t\ttempc = ALEF_U_H_;\n\t\tbreak;\n\tcase _AYN:\n\t\ttempc = _AYN_;\n\t\tbreak;\n\tcase AYN:\n\t\ttempc = AYN_;\n\t\tbreak;\n\tcase _GHAYN:\n\t\ttempc = _GHAYN_;\n\t\tbreak;\n\tcase GHAYN:\n\t\ttempc = GHAYN_;\n\t\tbreak;\n\tcase _HE:\n\t\ttempc = _HE_;\n\t\tbreak;\n\tcase YE:\n\t\ttempc = YE_;\n\t\tbreak;\n\tcase IE:\n\t\ttempc = IE_;\n\t\tbreak;\n\tcase TEE:\n\t\ttempc = TEE_;\n\t\tbreak;\n\tcase YEE:\n\t\ttempc = YEE_;\n\t\tbreak;\n\tdefault:\n\t\ttempc = 0;\n    }\n\n...\n    // Chunk 2.\n    switch (gchar_cursor())\n    {\n\tcase ALEF:\n\t\ttempc = ALEF_;\n\t\tbreak;\n\tcase ALEF_U_H:\n\t\ttempc = ALEF_U_H_;\n\t\tbreak;\n\tcase _AYN:\n\t\ttempc = _AYN_;\n\t\tbreak;\n\tcase AYN:\n\t\ttempc = AYN_;\n\t\tbreak;\n\tcase _GHAYN:\n\t\ttempc = _GHAYN_;\n\t\tbreak;\n\tcase GHAYN:\n\t\ttempc = GHAYN_;\n\t\tbreak;\n\tcase _HE:\n\t\ttempc = _HE_;\n\t\tbreak;\n\tcase YE:\n\t\ttempc = YE_;\n\t\tbreak;\n\tcase IE:\n\t\ttempc = IE_;\n\t\tbreak;\n\tcase TEE:\n\t\ttempc = TEE_;\n\t\tbreak;\n\tcase YEE:\n\t\ttempc = YEE_;\n\t\tbreak;\n\tdefault:\n\t\ttempc = 0;\n    }\n```\n\n#### evalfunc.c\n\n```c\n    // Code chunk 1.\n    /* Optional arguments: line number to stop searching and timeout. */\n    if (argvars[1].v_type != VAR_UNKNOWN \u0026\u0026 argvars[2].v_type != VAR_UNKNOWN)\n    {\n\tlnum_stop = (long)get_tv_number_chk(\u0026argvars[2], NULL);\n\tif (lnum_stop \u003c 0)\n\t    goto theend;\n#ifdef FEAT_RELTIME\n\tif (argvars[3].v_type != VAR_UNKNOWN)\n\t{\n\t    time_limit = (long)get_tv_number_chk(\u0026argvars[3], NULL);\n\t    if (time_limit \u003c 0)\n\t\tgoto theend;\n\t}\n#endif\n    }\n\n...\n  // Code chunk 2.\n\tif (argvars[5].v_type != VAR_UNKNOWN)\n\t{\n\t    lnum_stop = (long)get_tv_number_chk(\u0026argvars[5], NULL);\n\t    if (lnum_stop \u003c 0)\n\t\tgoto theend;\n#ifdef FEAT_RELTIME\n\t    if (argvars[6].v_type != VAR_UNKNOWN)\n\t    {\n\t\ttime_limit = (long)get_tv_number_chk(\u0026argvars[6], NULL);\n\t\tif (time_limit \u003c 0)\n\t\t    goto theend;\n\t    }\n#endif\n\t}\n```\n\n### Git\n\nCommit be5a750939c212bc0781ffa04fabcfd2b2bd744e.\n\n#### remote-curl.c\n\n```c\n       } else if (!strcmp(name, \"cloning\")) {\n\t\tif (!strcmp(value, \"true\"))\n\t\t\toptions.cloning = 1;\n\t\telse if (!strcmp(value, \"false\"))\n\t\t\toptions.cloning = 0;\n\t\telse\n\t\t\treturn -1;\n\t\treturn 0;\n\t} else if (!strcmp(name, \"update-shallow\")) {\n\t\tif (!strcmp(value, \"true\"))\n\t\t\toptions.update_shallow = 1;\n\t\telse if (!strcmp(value, \"false\"))\n\t\t\toptions.update_shallow = 0;\n\t\telse\n\t\t\treturn -1;\n\t\treturn 0;\n```\n\n#### xdiff/xprepare.c\n\nThis one looks especially weird.\n\n```c\n\tif ((mlim = xdl_bogosqrt(xdf1-\u003enrec)) \u003e XDL_MAX_EQLIMIT)\n\t\tmlim = XDL_MAX_EQLIMIT;\n\tfor (i = xdf1-\u003edstart, recs = \u0026xdf1-\u003erecs[xdf1-\u003edstart]; i \u003c= xdf1-\u003edend; i++, recs++) {\n\t\trcrec = cf-\u003ercrecs[(*recs)-\u003eha];\n\t\tnm = rcrec ? rcrec-\u003elen2 : 0;\n\t\tdis1[i] = (nm == 0) ? 0: (nm \u003e= mlim) ? 2: 1;\n\t}\n\n\tif ((mlim = xdl_bogosqrt(xdf2-\u003enrec)) \u003e XDL_MAX_EQLIMIT)\n\t\tmlim = XDL_MAX_EQLIMIT;\n\tfor (i = xdf2-\u003edstart, recs = \u0026xdf2-\u003erecs[xdf2-\u003edstart]; i \u003c= xdf2-\u003edend; i++, recs++) {\n\t\trcrec = cf-\u003ercrecs[(*recs)-\u003eha];\n\t\tnm = rcrec ? rcrec-\u003elen1 : 0;\n\t\tdis2[i] = (nm == 0) ? 0: (nm \u003e= mlim) ? 2: 1;\n\t}\n```\n\n## Notes\n\n[0] For reference on Code Clone Taxonomy see [fairly recent\npaper](http://www.sciencedirect.com/science/article/pii/S0167642309000367) by\nRoy et al.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkirillbobyrev%2Fcode-clone-detection-llvm-devmtg15-poster","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkirillbobyrev%2Fcode-clone-detection-llvm-devmtg15-poster","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkirillbobyrev%2Fcode-clone-detection-llvm-devmtg15-poster/lists"}