{"id":31772048,"url":"https://github.com/aboutcode-org/ai-gen-code-search","last_synced_at":"2025-10-10T03:55:19.840Z","repository":{"id":263899841,"uuid":"870191398","full_name":"aboutcode-org/ai-gen-code-search","owner":"aboutcode-org","description":"A set of utilities and tools to detect and search AI-generated code ","archived":false,"fork":false,"pushed_at":"2025-05-30T07:31:45.000Z","size":2676,"stargazers_count":8,"open_issues_count":2,"forks_count":2,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-09-17T12:00:36.258Z","etag":null,"topics":["code","genai","matching","search"],"latest_commit_sha":null,"homepage":"https://ai-gen-code-search.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aboutcode-org.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.rst","contributing":null,"funding":null,"license":null,"code_of_conduct":"CODE_OF_CONDUCT.rst","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.rst","dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-10-09T15:47:16.000Z","updated_at":"2025-09-09T10:56:53.000Z","dependencies_parsed_at":"2024-11-20T22:55:43.807Z","dependency_job_id":"a2a92236-fd3e-4e54-a078-efdcb775ae29","html_url":"https://github.com/aboutcode-org/ai-gen-code-search","commit_stats":null,"previous_names":["aboutcode-org/ai-gen-code-search"],"tags_count":3,"template":false,"template_full_name":"aboutcode-org/skeleton","purl":"pkg:github/aboutcode-org/ai-gen-code-search","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aboutcode-org%2Fai-gen-code-search","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aboutcode-org%2Fai-gen-code-search/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aboutcode-org%2Fai-gen-code-search/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aboutcode-org%2Fai-gen-code-search/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aboutcode-org","download_url":"https://codeload.github.com/aboutcode-org/ai-gen-code-search/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aboutcode-org%2Fai-gen-code-search/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279002601,"owners_count":26083426,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-10T02:00:06.843Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code","genai","matching","search"],"created_at":"2025-10-10T03:55:13.205Z","updated_at":"2025-10-10T03:55:19.832Z","avatar_url":"https://github.com/aboutcode-org.png","language":"Python","readme":"=========================================\n  AI-Generated Code Search\n=========================================\n\n``Search, detect, and identify AI-generated code.``\n\nThe AI-Generated Code Search project provides open source tools to find code that may have been\ngenerated using LLMs and GPT tools.\n\nGenerative AI engines and Large Language Models (LLMs) are emerging as viable tools for software\ndevelopers to automate writing code. These engines and LLMs are trained on publicly available, free\nand open source (FOSS) code.\n\nAI-generated code can inherit the license and vulnerabilities of the FOSS code used for its\ntraining. It is essential and urgent to identify AI-generated source code, as it threatens the\nfoundation of open source development and software development and raises major ethical, legal and\nsecurity questions.\n\nBased on the AbouCode track record of creating industry-leading FOSS code origin analysis tools for\nlicense and security, this project delivers a new approach to identify and detect if AI-generated\ncode is derived from existing FOSS with a new code fragments approximate similarity search.\n\nWe believe that AI-generated code identification is essential to ensure responsible use of that code\nwhile enjoying the productivity gains from Generative AI for code. There is a massive potential for\nmisuse, malignant or illegal use of such code, and identifying AI-generated code will enable safer,\nefficient and responsible use of GenAI to help build better software for the next generation\ninternet, faster and more efficiently.\n\n\nProblem\n-------------\n\n\nFOSS is the foundation of all modern software. It is imperative to know the origin of reused FOSS\ncode and its vulnerabilities and licenses, especially for software supply chain security and\ncybersecurity regulations, like SBOM mandates and Europe's upcoming CRA.\n\nThis applies to AI-generated code derived from FOSS code. The problem is so acute that several large\ncompanies have issued policies prohibiting their programmers from using AI code generation tools.\n\nIdentifying AI-generated code requires matching using a large index of FOSS code. Scale is a\nsignificant problem because incumbents need ever larger code indexes for accurate results, leading\nto slower search queries and wasteful energy infrastructures.\n\n\nExisting approaches identify code fragments only exactly and cannot work with AI-generated code as\neach generation may yield different code, resulting in false negatives and false positives. All of\nthis results in serious concerns for organizations to use AI-generated code.\n\n\nGoals\n------------\n\nThis project goal is to deliver a reusable open source library and the indexing code to create an\nopen dataset to identify if source code is AI-generated and report which FOSS project it derives\nfrom.\n\nThe intent is that AI-Generated Code Search will enhance trust for users searching for code on the\ninternet. This project aims to provide information and validation on the derivative open source\norigin of AI-generated code to identify its vulnerabilities and licenses to mitigate any software\nsupply chain integrity or security risks associated with using AI-generated code.\n\nWe hope that this solution will increase users' trust of using both AI-generated code found on the\nweb and code created by a Generative AI engine or LLM.\n\n\nSolution\n---------\n\nWe need to provide quality and trustworthy approximate code fragments matching on smaller indexes to\naccommodate the variety of AI-generated code and the growing volume of open source code used to\ntrain the backing LLMs.\n\nThe ambition of this project is to provide a new approach to code matching with locality-sensitive\nhashing (LSH) using random projections tunable for precision and recall, and avoid an ever\nincreasing code index size. This will make it practical for users to identify AI-generated code at\nscale.\n\nProviding a practical solution to discover AI-generated code fragments will improve the\ntrustworthiness of both AI-generated and non-generated code - and open source code in particular -\nby reporting its AI-generated status and ultimate FOSS origin.\n\n\nThis project will enable trustworthy, safer and efficient usage of LLM-based code generation with\nthis critical knowledge, improving overall programmer productivity and adoption of responsible AI-\ncode generation tools.\n\n\nApproach and Design\n-----------------------\n\nExisting code matching tools only match fragments exactly with low recall results, because exact\nmatching requires indexing the largest possible number of fragments to avoid false negative matches,\nand this volume leads to more false positives. Exact matching does not work for AI-generated code\nwhere each generation may have small variations given the same input prompt.\n\nOur approach for matching code fragments consists of these three elements:\n\n1. A content-defined chunking algorithm to split code into fragments - an improved alternative to\nthe common winnowing.\n\n2. The fingerprinting of code with a locality sensitive hashing (LSH) function for approximate\nmatching of fragments. This fingerprint scheme embeds multiple precision in a single bit string to\ntune the search precision and organize the search in rounds of progressively increasing precision.\n\n3. An indexing-time matching with each new fragment only added to the index if it is not matchable\nin the index, avoiding large duplications of FOSS code.\n\n\nImplementation\n----------------------------\n\nExisting code fragment search solutions are either closed data, expensive proprietary closed source\nsolutions, or use code indexes that are too big to share. None are designed or adapted to support\nAI-generated code search. Because of their costs, none are practical or accessible for SMEs or\nindividuals because they are too expensive to acquire and operate.\n\nThis project integrates with the open AboutCode stack's existing extensive and open code analysis\ncapabilities to enable:\n\n1. Holistic and comprehensive knowledge of software code origin, including AI-generated code\n\n2. Confidence that the licensing and known vulnerabilities of the whole code is tracked and managed\n\n3. Direct support for CRA compliance, and the implied mandate to track code origin and report\nsecurity issues for software code\n\n\nThese tools are designed to be used either:\n\n- As the building blocks for a larger solution, or\n- As an integrated solution with the open source AboutCode stack.\n\n\nRoadmap\n--------------\n\nThe high level plan for this project is to:\n\n- Design and implement core fingerprinting and content-defined chunking algorithms\n- Execute evaluation and tuning of these algorithms\n- Package these algorithms in a reusable library\n- Design and implement index storage data structures\n- Implement efficient hamming distance fingerprint matching, e.g., the core search\n- Create AI-generated code test dataset. Reuse existing dataset where relevant\n- Create reference dataset for indexing (reusing PurlDB and SWH)\n- Create indexing pipeline and REST API with index-time matching\n- Implement search results ranking procedure\n- Create searching pipeline and REST API\n- Create search query client to search a whole codebase\n- Execute at-scale evaluation and tuning campaigns of end-to-end solution\n- Package and document library and whole solution for easy deployment and reuse\n- Deploy public demo system\n- Present at FOSDEM and webinars for community dissemination\n\n\nThis repository also contains a SameCode library. See README_samecode.rst for details.\n\n\nAcknowledgements, Funding, Support and Sponsoring\n--------------------------------------------------------\n\n|europa|\n\n|ngisearch|\n\nFunded by the European Union. Views and opinions expressed are however those of the author(s) only\nand do not necessarily reflect those of the European Union or European Commission. Neither the\nEuropean Union nor the granting authority can be held responsible for them. Funded within the\nframework of the NGI Search project under grant agreement No 101069364\n\n\nThis project is also supported and sponsored by:\n\n- Generous support and contributions from users like you!\n- Microsoft and Microsoft Azure\n- AboutCode ASBL\n\n\n|aboutcode|\n\n\n.. |ngisearch| image:: https://www.ngisearch.eu/download/FlamingoThemes/NGISearch2/NGISearch_logo_tag_icon.svg?rev=1.1\n    :target: https://www.ngisearch.eu/\n    :height: 50\n    :alt: NGI logo\n\n\n.. |ngi| image:: https://ngi.eu/wp-content/uploads/thegem-logos/logo_8269bc6efcf731d34b6385775d76511d_1x.png\n    :target: https://www.ngi.eu/ngi-projects/ngi-search/\n    :height: 37\n    :alt: NGI logo\n\n.. |europa| image:: etc/eu.funded.png\n    :target: https://commission.europa.eu/index_en\n    :height: 120\n    :alt: Europa logo\n\n.. |aboutcode| image:: https://aboutcode.org/wp-content/uploads/2023/10/AboutCode.svg\n    :target: https://aboutcode.org/\n    :height: 30\n    :alt: AboutCode logo\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faboutcode-org%2Fai-gen-code-search","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faboutcode-org%2Fai-gen-code-search","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faboutcode-org%2Fai-gen-code-search/lists"}