{"id":16509913,"url":"https://github.com/papachristoumarios/sade","last_synced_at":"2026-03-12T09:24:37.974Z","repository":{"id":40985586,"uuid":"144055976","full_name":"papachristoumarios/sade","owner":"papachristoumarios","description":"Code for paper:  Software clusterings with vector semantics and the call graph","archived":false,"fork":false,"pushed_at":"2022-06-21T21:27:56.000Z","size":1964,"stargazers_count":9,"open_issues_count":2,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-06-28T11:43:44.733Z","etag":null,"topics":["c","cflow","cscout","doc2vec","layering","layering-violations","natural-language-processing","refactoring","word-embeddings"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/papachristoumarios.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-08-08T18:59:13.000Z","updated_at":"2024-04-23T16:17:57.000Z","dependencies_parsed_at":"2022-09-02T07:40:41.625Z","dependency_job_id":null,"html_url":"https://github.com/papachristoumarios/sade","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/papachristoumarios/sade","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/papachristoumarios%2Fsade","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/papachristoumarios%2Fsade/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/papachristoumarios%2Fsade/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/papachristoumarios%2Fsade/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/papachristoumarios","download_url":"https://codeload.github.com/papachristoumarios/sade/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/papachristoumarios%2Fsade/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279013684,"owners_count":26085390,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c","cflow","cscout","doc2vec","layering","layering-violations","natural-language-processing","refactoring","word-embeddings"],"created_at":"2024-10-11T15:53:11.774Z","updated_at":"2025-10-13T04:33:50.845Z","avatar_url":"https://github.com/papachristoumarios.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# :crystal_ball: SADE: Software Architecture with Document Embeddings\n\n## :question: What is SADE?\n\nSADE (abbreviated as Software Architecture with Document Embeddings) is a library for studying and recovering the architectures of complex softwares systems. Our approach uses a combination of document embeddings on the source code provided by **Doc2Vec** as well as the existing structure of the codebase via the **call graphs**, produced by **CScout**. \n\nDocument embeddings have never been used before to study the architecture of a software system. We will construct a geometric graph on a pseudo-metric space and iteratively and form communities in this graph, creating clusters that represent modules of software using the **Louvain Algorithm**. The proposed evaluation metrics for software clusterings are **stability**, **authoritativeness** (closeness to the ground truth)  and **extremity** (avoiding the creation of very small or very large clusters). \n\nThis project was curated for the **ESEC/FSE 2019 Student Research Competition**.  You can read the paper [here](https://dl.acm.org/citation.cfm?id=3342483) as well as the [slides](https://github.com/papachristoumarios/software-clusterings-with-vector-semantics-and-call-graph/raw/master/slides/slides.pdf).  \n\nThe software is released under the MIT License.\n\n## :nut_and_bolt: Installation\n\nInstalling system/user-wide (with sudo if system-wide):\n\n```bash\nmake install\n```\n\nInstalling on a virtual environment using `virtualenv`:\n\n```bash\nmake install_venv\n```\n\n\n\n## :hammer_and_wrench: Usage\n\nWith SADE you can analyze your C project using the components provided by it. Below there are steps on how you should do it. We will be using [CScout](https://github.com/dspinellis/cscout) for Static Graph Analysis.\n\n\n\n### Step 1: Generate Grains\n\nFor defining the modules of the system, each file must map to a grain. You should generate a `modules.json` file with the following format:\n\n```json\n{\n    \"boo.c\" : \"boograin\",\n    \"foo.c\" : \"foograin\"\n}\n```\n\nYou can do this manually, but in case the project is strictly organized into grains (e.g. one-top directories) you can use the `autogen_module` tool to generate the module definition. You can do this by:\n\n```bash\nautogen_module.py --suffix .c --suffix .h -d 1 \u003emodules.json\n```\n\nwhere the `-d` specifies the depth that the modules must be split. An example is located at `examples/linux/modules.json`.\n\nFor scalability purposes you can manually set the `--suffix` arguments for other languages. For example, for a C++ project\n\n```bash\nautogen_module.py --suffix .cpp --suffix .h -d 1 \u003emodules.json\n```\n\n\n\n\n\n### Step 2: Generate document embeddings\n\nAfter creating the `modules.json` definitions file you can proceed generating the Doc2Vec using Gensim and spaCy preprocessed with the following pipeline:\n\n1. `autogen_module.py --suffix .c --suffix .h -d 1 \u003emodules.json`\n2. Stop-word Removal\n3. Tokenization\n4. Lemmatization\n\nYou can generate the embeddings with the `embeddings.py` script using\n\n```bash\nembeddings.py -m modules.json -o embeddings.bin -p params.json\n```\n\nYou can configure it further by passing parameters for the model with `-p` flag as a `params.json` file.\n\nA `params.json` file example:\n\n```json\n{\n    \"size\": 200,\n    \"epochs\" : 1000,\n    \"window\" : 10,\n    \"min_count\": 10,\n    \"workers\":7,\n    \"sample\": 1E-3\n}\n```\n\n\n\n#### Pretrained Models\n\nFor the purposes of our research we have trained the document embeddings for the Linux Kernel Codebase v4.21. From here you can download the embeddings produced with `gensim`.  \n\n1. [Document Embeddings (One-top directory Level without Identifier Splitting)](https://pithos.okeanos.grnet.gr/public/MjvTbBkLWC6tSlTmK1yiq3)\n2. [Document Embeddings (One-top directory Level with Identifier Splitting)](https://pithos.okeanos.grnet.gr/public/TAEsZW4IJZgrN9aanI11a7)\n3. [Document Embeddings (Source Code File Level)](https://pithos.okeanos.grnet.gr/public/3cEM9HxM7KG7AEdlkKvcA4)\n\n\n\n### Step 3: Generating the Call Graph through CScout\n\nGenerate the `make.cs` file via:\n\n```bash\ncsmake\n```\n\nin case you have a multi-core machine you can use the classic `-j` flag:\n\n```bash\ncsmake -j7\n```\n\nAfter generating the `make.cs` file you can analyze it with CScout via\n\n```bash\ncscout make.cs\n```\n\nCScout may complain for undefined names. What you can to is to place their respective definitions to `cscout-pre-defs.h` (before `csmake`) and to `cscout-post-defs.h`. For more information on it, please refer to [CScout Documentation](https://www2.dmst.aueb.gr/dds/cscout/doc).\n\nAn example of such configuration for the Linux Kernel 4.x Codebase is located at `examples/linux` .\n\nFinally, you can send `GET` requests to CScout and get responses through its REST API.\n\nFor example:\n\n```bash\n# Call graph (functions)\ncurl -X GET \"http://localhost:8081/cgraph.txt\" \u003egraph.txt\n```\n\nYou can get all the call graphs via running `scripts/get_graphs_rest.sh`.\n\n\n\n#### Pre-generated call graph for Linux Kernel 4.21\n\nA pre-generated call graph of Linux Kernel 4.21 (20.3 million lines of source code) can be found [here](https://zenodo.org/record/2652487). The call graphs come to a format:\n\n```\nu1 v1\nu2 v2\n// more edges\nun vn\n```\n\nwhere `ui vi` is a directed edge from `ui` to `vi`.\n\nThe call graph was generated on an Intel(R) Xeon(R) CPU E5-1410 0 @ 2.80GHz with 72G of RAM.\n\n\n\n### Step 4: Getting the layers configuration\n\nAfter generating the embeddings you can use the `layerize.py` tool to get the proposed layered architecture. You can do it by:\n\n```bash\nlayerize.py -e embeddings.bin -g graph.txt \u003elayers.bunch\n```\n\nto export it to a `.bunch` file. The format of a bunch file is:\n\n```\nLayer0= File1, File2, File3\n```\n\nor to JSON with:\n\n```bash\nlayerize.py -e embeddings.bin -g graph.txt --export json \u003elayers.json\n```\n\n\n\n### Step 5 (Optional) : Evaluation of Results\n\n#### Authoritativeness - Comparing to Ground Truth\n\nOnce generating the layered architecture, in case there is an existing one serving as ground truth, such that the Linux Layers located at `examples/linux/ground_truth.json` you can compare the architectures with the MoJoFM metric provided in the `mojo` package via:\n\n```python\nimport mojo\nmojo.mojo('proposed_layers.bunch', 'ground_truth.bunch', '-fm')\n```\n\n\n\n## :pick: Technologies Used\n\nSADE was developed in Python 3.x using the following libraries:\n\n* Gensim\n* spaCy\n* sklearn\n* NetworkX\n\n\n\n## Using SADE to analyze projects in other programming languages\n\n### Generating the call graph\n\nYou can use SADE with a different static call graph analyzer tool for your preferred language. The format that SADE understands is of the form\n\n```\nfoo.c boo.c\n```\n\nwhich indicates a **directed** edge from `foo.c` to `boo.c`. \n\n### Module Definitions\n\nThe module definitions are, as explained above, contained in JSON files.\n\n\n\n### Clustering Results\n\nThe clustering results are, as explained above, contained in JSON or Bunch files.\n\n\n\n\n\n## Citing the Project\n\nYou can cite the project using the following bibliographic entries\n\n```latex\n@inproceedings{sade,\n    title={Software Clusterings with Vector Semantics and the Call Graph},\n    author={Papachristou, Marios},\n    year={2019},\n    booktitle={ACM Joint European Software Engineering Conference and Symposium on the \tFoundations of Software Engineering (ESEC/FSE)},\n    organization={Association for Computing Machinery}\n}\n\n@misc{call_graph, \n    title={Linux Kernel 4.21 Call Graph},\n    DOI={10.5281/zenodo.2652487}, \n    publisher={Zenodo}, \n    author={Papachristou, Marios}, \n    year={2019}\n}\n\n@misc{sade_source_code, \n    title={Software Architecture with Document Embeddings and the Call Graph Source Code}, \n    DOI={10.5281/zenodo.2673033}, \n    publisher={Zenodo},\n    author={Papachristou, Marios},\n    year={2019}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpapachristoumarios%2Fsade","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpapachristoumarios%2Fsade","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpapachristoumarios%2Fsade/lists"}