{"id":27881867,"url":"https://github.com/src-d/apollo","last_synced_at":"2025-05-05T05:05:55.347Z","repository":{"id":48959761,"uuid":"108382486","full_name":"src-d/apollo","owner":"src-d","description":"Advanced similarity and duplicate source code proof of concept for our research efforts.","archived":false,"fork":false,"pushed_at":"2022-09-05T11:20:40.000Z","size":202,"stargazers_count":52,"open_issues_count":12,"forks_count":17,"subscribers_count":17,"default_branch":"master","last_synced_at":"2025-05-05T05:05:50.031Z","etag":null,"topics":["duplicate-detection","duplicates","python","similarity","similarity-search","source-code"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/src-d.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-10-26T08:21:59.000Z","updated_at":"2023-09-08T17:31:43.000Z","dependencies_parsed_at":"2023-01-17T20:15:31.552Z","dependency_job_id":null,"html_url":"https://github.com/src-d/apollo","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fapollo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fapollo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fapollo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fapollo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/src-d","download_url":"https://codeload.github.com/src-d/apollo/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252442486,"owners_count":21748451,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["duplicate-detection","duplicates","python","similarity","similarity-search","source-code"],"created_at":"2025-05-05T05:05:54.723Z","updated_at":"2025-05-05T05:05:55.335Z","avatar_url":"https://github.com/src-d.png","language":"Python","readme":"Apollo\n======\n\nAdvanced code deduplicator. Powered by [source\\{d\\} ML](https://github.com/src-d/ml),\n[source\\{d\\} engine](https://github.com/src-d/engine) and [minhashcuda](https://github.com/src-d/minhashcuda).\nAgnostic to the analysed language thanks to [Babelfish](https://doc.bblf.sh). Python 3, PySpark, CUDA inside.\n\n### What is this?\n\nsource{d}'s effort to research and solve the code deduplication problem. At scale, as usual.\nA [code clone](https://en.wikipedia.org/wiki/Duplicate_code) is several snippets of code with few differences.\nFor now this project focuses on find near-duplicate projects and files; it will eventually support\nfunctions and snippets in the future.\n\n### Should I use it?\n\nIf you've got hundreds of thousands of files or more, consider. Otherwise, use one of the many\nexisting tools which may be already integrated into your IDE.\n\n### Difference from [src-d/gemini](https://github.com/src-d/gemini)?\n\nThis guy is my brother. Apollo focuses on research, extensibility, flexibility and rapid\nchanges, while Gemini focuses on performance and serious production usage. All the proven and \ntested features will be eventually ported to Gemini. At the same time, Gemini may reuse some\nof Apollo's code.\n\n### Algorithm\n\nApollo takes the \"hash'em all\" approach. We extract unordered weighted features from code aka \"weighted bags\",\napply [Weighted MinHash](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36928.pdf)\nand then design the [Locality Sensitive Hashing index](http://infolab.stanford.edu/~ullman/mmds/ch3.pdf).\nAll items which appear in the same hashtable bucket are considered the same. The size of the hash\nand the number of hashtables depend on the [weighted Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index#Generalized_Jaccard_similarity_and_distance)\nthreshold (hence Weighted MinHash).\n\nThe features include identifiers such as variable, function or class names, literal values and *structural elements*.\nThe latter carries the topological information, and we currently support several variants: \"node2vec\",\n\"deterministic node2vec\" and \"role-children atoms\". Graphlets are upcoming. Different features\nhave different weights which will be tuned by a hyperparameter optimization algorithm or even an SGD\n(not yet implemented).\n\nIt's not all unfortunately! Dumping the huge graph of pairwise similarities is of little practicality.\nWe need to group (cluster) the neighborhoods of densely connected nodes. Apollo solves this problem\nin two steps:\n\n1. Run [connected components](https://en.wikipedia.org/wiki/Connected_component_(graph_theory))\nanalysis to find disjoint parts in the similarity graph.\n2. Run [community detection](https://en.wikipedia.org/wiki/Community_structure) to cluster the components.\nThe clusters are with overlaps.\n\n### Implementation\n\nApollo is structured as a series of commands in CLI. It stores data in [Cassandra](http://cassandra.apache.org/)\n(compatible with [Scylla](http://www.scylladb.com/)) and\nwrites MinHashCuda batches on disk. Community detection is delegated to [igraph](http://igraph.org/python/).\n\n* `resetdb` (erases) and initializes a Cassandra keyspace.\n* `bags` extracts the features, stores them in the database and writes MinHashCuda batches on disk.\nRuns source{d} engine through PySpark.\n* `hash` performs the hashing, writes the hashtables to the database and hashing parameters on disk\nin [Modelforge](https://github.com/src-d/modelforge) format.\n* `cc` fetches the buckets, runs the connected component analysis and writes the result on disk in Modelforge\nformat. Uses PySpark.\n* `cmd` reads the connected components and performs the community detection (by default, walktrap).\nUses PySpark.\n* `query` outputs items similar to the specified. In case of files, the path or the sha1 are accepted.\n* `dumpcmd` outputs the groups of similar items.\n\n### Installation\n\n```\nmount -o bind /path/to/sourced-ml bundle/ml\nmount -o bind /path/to/spark-2.2.0-bin-hadoop2.7 bundle/spark\nmount -o bind /path/to/sourced-engine bundle/engine\ndocker build -t srcd/apollo .\ndocker run --name scylla -p 9042:9042 -v /var/lib/scylla:/var/lib/scylla -d scylladb/scylla --developer-mode=1\ndocker run -it --rm --link scylla srcd/apollo resetdb --cassandra scylla\ndocker run -d --name bblfshd --privileged -p 9432:9432 -v /var/lib/bblfshd:/var/lib/bblfshd bblfsh/bblfshd\ndocker exec -it bblfshd bblfshctl driver install --all\n```\n\nYou are going to need [grip](https://github.com/joeyespo/grip) to instantly render Markdown reports\nin your browser. There multiple Docker options available, e.g.\n[1](https://github.com/psycofdj/docker-grip), [2](https://github.com/fstab/docker-grip),\n[3](https://github.com/kba/grip-docker).\n\n### Contributions\n\n...are welcome! See [CONTRIBUTING](CONTRIBUTING.md) and [code of conduct](CODE_OF_CONDUCT.md).\n\n### License\n\n[GPL](LICENSE.md).\n\n## Docker command snippets\n\n### Bags\n\n```\ndocker run -it --rm -v /path/to/io:/io --link bblfshd --link scylla srcd/apollo bags -r /io/siva \\\n--bow /io/bags/bow.asdf --docfreq /io/bags/docfreq.asdf -f id lit uast2seq --uast2seq-seq-len 4 \\\n-l Java Python -s 'local[*]' --min-docfreq 5 --bblfsh bblfshd --cassandra scylla --persist MEMORY_ONLY \\\n--config spark.executor.memory=4G spark.driver.memory=10G spark.driver.maxResultSize=4G\n```\n\n### Hash\n\n```\ndocker run -it --rm -v /path/to/io:/io --link scylla srcd/apollo hash /io/batches/bow*.asdf -p /io/bags/params.asdf \\\n-t 0.8 --cassandra scylla\n```\n\n### Query sha1\n\n```\ndocker run -it --rm -v /path/to/io:/io --link scylla srcd/apollo query -i \u003csha1\u003e --precise \\\n--docfreq /io/bags/docfreq.asdf -t 0.8 --cassandra scylla\n```\n\n### Query file\n\n```\ndocker run -it --rm -v /path/to/io:/io -v .:/q --link bblfshd --link scylla srcd/apollo query \\\n-f /q/myfile.java --bblfsh bblfshd --cassandra scylla --precise --docfreq /io/docfreq.asdf \\\n--params /io/params.asdf -t 0.9 | grip -b -\n```\n\n### Connected components\n\n```\ndocker run -it --rm -v /path/to/io:/io --link scylla srcd/apollo cc -o /io/ccs.asdf\n```\n\n### Dump connected components\n\n```\ndocker run -it --rm -v /path/to/io:/io srcd/apollo dumpcc -o /io/ccs.asdf\n```\n\n### Community detection\n\n```\ndocker run -it --rm -v /path/to/io:/io srcd/apollo cmd -i /io/ccs.asdf -o /io/communities.asdf -s 'local[*]'\n```\n\n### Dump communities (final report)\n\n```\ndocker run -it --rm -v /path/to/io:/io srcd/apollo dumpcmd /io/communities.asdf | grip -b -\n```\n","funding_links":[],"categories":["Software"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fapollo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsrc-d%2Fapollo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fapollo/lists"}