{"id":22942443,"url":"https://github.com/stardustdl/codesim","last_synced_at":"2025-08-12T21:32:37.125Z","repository":{"id":38151383,"uuid":"435728759","full_name":"StardustDL/codesim","owner":"StardustDL","description":"A similarity measurer on two programming assignments on Online Judge.","archived":false,"fork":false,"pushed_at":"2023-01-06T02:05:30.000Z","size":33,"stargazers_count":9,"open_issues_count":5,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-09-27T09:11:52.356Z","etag":null,"topics":["code-copying","code-similarity","cpp","nju","nju-cs","plagiarism-detection"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/StardustDL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-12-07T03:22:39.000Z","updated_at":"2024-07-04T03:46:07.000Z","dependencies_parsed_at":"2023-02-05T02:32:01.257Z","dependency_job_id":null,"html_url":"https://github.com/StardustDL/codesim","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StardustDL%2Fcodesim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StardustDL%2Fcodesim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StardustDL%2Fcodesim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StardustDL%2Fcodesim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/StardustDL","download_url":"https://codeload.github.com/StardustDL/codesim/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":229710743,"owners_count":18111641,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code-copying","code-similarity","cpp","nju","nju-cs","plagiarism-detection"],"created_at":"2024-12-14T13:47:50.945Z","updated_at":"2024-12-14T13:47:51.372Z","avatar_url":"https://github.com/StardustDL.png","language":"Python","readme":"![codesim](https://socialify.git.ci/StardustDL/codesim/image?description=1\u0026font=Bitter\u0026forks=1\u0026issues=1\u0026language=1\u0026owner=1\u0026pulls=1\u0026stargazers=1\u0026theme=Light)\n\n![CI](https://github.com/StardustDL/codesim/workflows/CI/badge.svg) ![](https://img.shields.io/github/license/StardustDL/codesim.svg)\n\nA similarity measurer on two programming assignments on Online Judge.\n\n## Install\n\nRecommend OS: Ubuntu 20.04.\n\n\u003e Other Linux distribution is OK, but Windows and Mac OS with Python 3.10 may fail since codesim depends on [ortools](https://pypi.org/project/ortools/).\n\nInstall Python(\u003e=3.7), pip, g++, and objdump.\n\nAn example script for Ubuntu 20.04.\n\n```sh\n# Ubuntu 20.04 has Python 3.8 installed, use python3 to run python\napt update\n# Install g++ and objdump\napt install build-essential\n```\n\n**Development Way** Install requirements.\n\n```sh\ncd src\npip install -r requirements.txt\n```\n\n**Package Way** Build and install a portable Python Wheel package.\n\n```sh\ncp README.md ./src\ncd src\npython -m pip install --upgrade build twine\npython -m build -o ../dist\npython -m pip install ../dist/codesim-0.0.1-py3-none-any.whl\n```\n\n## Usage\n\n**Development Way**\n\n```sh\ncd src\npython -m codesim \u003cfile1\u003e \u003cfile2\u003e\n\n# verbose mode to see log\npython -m codesim \u003cfile1\u003e \u003cfile2\u003e [-v/-vv/-vvv..]\n```\n\n**Package Way** If you have installed the built package, then just use the installed package.\n\n```sh\npython -m codesim \u003cfile1\u003e \u003cfile2\u003e\n\ncodesim \u003cfile1\u003e \u003cfile2\u003e\n```\n\n## Reference\n\nThe code similarity measuring algorithm originates from\n\n\u003e Jiang Y, Xu C. Needle: Detecting code plagiarism on student submissions[C]//Proceedings of ACM Turing Celebration Conference-China. 2018: 27-32.\n\nSome test cases are from [CodeNet Dataset](https://github.com/IBM/Project_CodeNet).\n\nThe code similarity measuring algorithm originates from\n\n\u003e Jiang Y, Xu C. Needle: Detecting code plagiarism on student submissions[C]//Proceedings of ACM Turing Celebration Conference-China. 2018: 27-32.\n\n## Algorithm\n\n\u003e Algorithm implementation details are from [here](http://www.stardustdl.top/posts/projects/codesim/).\n\n### Goals\n\nWe want to measure similarity between two programming assignments $A$ and $B$ on Online Judge to find possible plagiarism.\nWe assume that each input program is a single-file C++ program that can be compiled by `g++ -std=c++17 -pedantic`.\n\n### Preprocessing\n\nThe compiling and optimization removes comments, macros and unnessesary code, ignores local variable names and code format.\nMany redundant changes will have zero or minor impacts after compiler optimization and it is a good way to normalize a program.\nTo decrease obfuscation changes' impacts further, we use opcode sequence as a function's figureprint and ignore operands.\n\nA program is a set of functions, and a function is a sequence of opcodes.\n\nWe first compile the input code by `g++` with `-O2` optimization level.\nTo keep the generated object file clean, we use `-c` option to prevent generating initializing function.\n\nThen we use `objdump` to disassembly object files, collect and filter (ignore `nop` and unrecogized opcodes) opcode sequence.\n\n### Similarity\n\nOne common kind of obfuscation changes is splitting one function into many functions.\nTo address this, we calculate inter-function similarity (as same as the program similarity) with intra-function similarity.\nThe main idea is mapping each instruction in program $A$ to the most similar instruction in program $B$.\n\n#### Intra-function Similarity\n\nIntra-function similarity models the similarity of a instruction in a specific function context.\n\nLet $f\\in A$ be a function from program $A$, and $g$ be a function from program $B$.\nIf $f$ can be extended to $g$, then there may be a code copy case.\nWe use longest common subsequence (LCS) to calculate the cost to extend $f$ to $g$.\nTo preserve integrity of $f$ during extending, we calculate the longest LCS in a fixed window size $\\omega=\\frac{3}{2}|f|$.\n\n$$\\sigma(f,g)=\\max_{k\\in\\\\{1,2,\\dots,|g|\\\\}}\\text{LCS}(f,g[k:k+\\omega])$$\n\nFormally, the intra-function similarity between $f$ and $g$ is defined as\n\n$$\\rho(f,g)=\\frac{\\max\\\\{\\sigma(f_i,g_j),\\sigma(g_j,f_i)\\\\}}{\\min\\\\{|f_i|,|g_j|\\\\}}$$\n\nFor efficiency, we use the following strategies: use integer for opcode to speed up comparison, calculate $\\sigma(f,g)$ by the following formula.\n\n$$\\sigma(f,g)=\\begin{cases}\n    \\text{LCS}(f,g) \u0026 \\omega\u003e=|g| \\\\\n    \\max_{k\\in\\\\{1,2,\\dots,|g|-\\omega\\\\}}\\text{LCS}(f,g[k:k+\\omega]) \u0026 \\text{otherwise}\n\\end{cases}$$\n\n#### Inter-function Similarity\n\nWe models the mapping problem by a weighted flow network graph $G=(V,E,c:E\\rightarrow \\mathbb{N},w:E\\rightarrow \\mathbb{R})$.\n\nLet $n=|A|,m=|B|,i\\in[n], j\\in [m],f_i\\in A,g_j\\in B$.\n\n$$\\begin{aligned}\n    V\u0026=\\\\{s,t\\\\}\\cup\\\\{l_i\\\\}\\cup\\\\{r_j\\\\}\\\\\n    E\u0026=\\\\{(s,l_i)\\\\}\\cup\\\\{(r_j,t)\\\\}\\cup \\\\{(l_i,r_j)\\\\}\\\\\n    c(s,l_i)\u0026=|f_i|,w(s,l_i)=0\\\\\n    c(r_j,t)\u0026=|g_j|,w(r_j,t)=0\\\\\n    c(l_i,r_j)\u0026=\\sigma(f_i,g_j)\\\\\n    w(l_i,r_j)\u0026=\\text{sigmoid}'(\\rho(f,g))\n\\end{aligned}$$\n\nWe use sigmoid function's center part, $\\text{sigmoid}'(x)=\\text{sigmoid}(\\alpha x+\\beta)$, with constants $\\alpha=2,\\beta=-1/2$ to normalize $\\rho(f,g)$.\n\nThen the unnormalized inter-function similarity from $A$ to $B$ is defined as\n\n$$\n\\rho'(A\\rightarrow B)=\\frac{\\text{MaximumWeightFlow(G)}}{\\sum_{i}|f_i|}\n$$\n\nThen normalize $\\rho'(A\\rightarrow B)$ onto $[0,1]$.\n\n$$\n\\rho(A\\rightarrow B)=\\frac{\\rho'(A\\rightarrow B)}{\\text{sigmoid}'(1)}\n$$\n\nFinally the inter-function similarity between $A$ and $B$ is defined as the average of the two directions.\n\n$$\n\\rho(A,B)=\\frac{\\rho(A\\rightarrow B)+\\rho(B\\rightarrow A)}{2}\n$$\n\nFor efficiency, we use the following strategies: calculate $\\sigma(f_i,g_j)$ for all $(i,j)$ pairs parallel and cache the results, use integer $\\lfloor \\theta \\cdot w(l_i,r_j) \\rfloor$, in which $\\theta=10000$, to replace the real number $w(l_i,r_j)$.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstardustdl%2Fcodesim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstardustdl%2Fcodesim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstardustdl%2Fcodesim/lists"}