{"id":18026862,"url":"https://github.com/yegor256/cam","last_synced_at":"2025-03-27T01:31:30.138Z","repository":{"id":40426965,"uuid":"371437133","full_name":"yegor256/cam","owner":"yegor256","description":"Classes and Metriсs (CaM): a dataset of Java classes from public open-source GitHub repositories","archived":false,"fork":false,"pushed_at":"2024-04-30T16:52:18.000Z","size":2318,"stargazers_count":18,"open_issues_count":37,"forks_count":31,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-05-01T23:12:14.168Z","etag":null,"topics":["cyclomatic-complexity","dataset","java","metrics","metrics-gathering"],"latest_commit_sha":null,"homepage":"http://cam.yegor256.com","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yegor256.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-05-27T16:25:29.000Z","updated_at":"2024-05-03T05:28:23.317Z","dependencies_parsed_at":"2023-09-27T21:34:52.556Z","dependency_job_id":"4d8c3eb9-c801-417c-bbca-745d541d12b0","html_url":"https://github.com/yegor256/cam","commit_stats":{"total_commits":164,"total_committers":7,"mean_commits":"23.428571428571427","dds":"0.21341463414634143","last_synced_commit":"f882787f1f76dd2f60133dc22d4a18633d215cba"},"previous_names":[],"tags_count":35,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yegor256%2Fcam","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yegor256%2Fcam/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yegor256%2Fcam/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yegor256%2Fcam/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yegor256","download_url":"https://codeload.github.com/yegor256/cam/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245764724,"owners_count":20668467,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cyclomatic-complexity","dataset","java","metrics","metrics-gathering"],"created_at":"2024-10-30T08:08:19.356Z","updated_at":"2025-03-27T01:31:30.126Z","avatar_url":"https://github.com/yegor256.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Classes and Metrics (CaM)\n\n[![arXiv](https://img.shields.io/badge/arXiv-2403.08488-green.svg)](https://arxiv.org/abs/2403.08488)\n[![make](https://github.com/yegor256/cam/actions/workflows/make.yml/badge.svg?branch=master)](https://github.com/yegor256/cam/actions/workflows/make.yml)\n[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/yegor256/ctors-vs-size/blob/master/LICENSE.txt)\n[![Docker Cloud Automated build](https://img.shields.io/docker/cloud/automated/yegor256/cam)](https://hub.docker.com/r/yegor256/cam)\n[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=yegor256_cam2\u0026metric=alert_status)](https://sonarcloud.io/summary/new_code?id=yegor256_cam2)\n\nThis is a dataset of open source Java classes and some metrics on them.\nEvery now and then I make a new version of it using the scripts\nin this repository. You are welcome to use it in your researches.\nEach release has a fixed version. By referring to it in your research\nyou avoid ambiguity and guarantees repeatability of your experiments.\n\nThis is a more formal explanation of this project:\n[in PDF](https://arxiv.org/abs/2403.08488).\n\nThe latest ZIP archive with the dataset is here:\n[cam-2024-03-02.zip](http://cam.yegor256.com/cam-2024-03-02.zip)\n(2.22Gb).\nThere are **48 metrics** calculated for **532,394 Java classes** from\n**1000 GitHub repositories**, including:\nlines of code (reported by [cloc](https://github.com/AlDanial/cloc));\n[NCSS](https://stackoverflow.com/questions/5486983/what-does-ncss-stand-for);\n[cyclomatic](https://en.wikipedia.org/wiki/Cyclomatic_complexity) and\n[cognitive complexity](https://en.wikipedia.org/wiki/Cognitive_complexity)\n(by [PMD](https://pmd.github.io/));\n[Halstead](https://en.wikipedia.org/wiki/Halstead_complexity_measures)\nvolume, effort, and difficulty;\n[maintainability index](https://ieeexplore.ieee.org/abstract/document/303623);\nnumber of attributes, constructors, methods;\nnumber of Git authors;\nand others ([see PDF](http://cam.yegor256.com/cam-2024-03-02.pdf)).\n\nPrevious archives (took me a few days to build each of them, using a pretty big machine):\n\n* [cam-2024-03-02.zip](http://cam.yegor256.com/cam-2024-03-02.zip)\n  (2.22Gb): 1000 repos, 48 metrics, 532K classes\n* [cam-2023-10-22.zip](http://cam.yegor256.com/cam-2023-10-22.zip)\n  (2.19Gb): 1000 repos, 33 metrics, 863K classes\n* [cam-2023-10-11.zip](http://cam.yegor256.com/cam-2023-10-11.zip)\n  (3Gb): 959 repos, 29 metrics, 840K classes\n* [cam-2021-08-04.zip](https://github.com/yegor256/cam/releases/download/0.2.0/cam-2021-08-04.zip)\n  (692Mb): 1000 repos, 15 metrics\n* [cam-2021-07-08.zip](https://github.com/yegor256/cam/releases/download/0.1.1/cam-2021-07-08.zip)\n  (387Mb): 1000 repos, 11 metrics\n\nIf you want to create a new dataset,\njust run the following command and the entire dataset will\nbe built in the current directory\n(you need to have [Docker](https://docs.docker.com/get-docker/) installed),\nwhere `1000` is the number of repositories to fetch from GitHub\nand `XXX` is\nyour [personal access token][create-PAT]:\n\n```bash\ndocker run --detach --name=cam --rm --volume \"$(pwd):/dataset\" \\\n  -e \"TOKEN=XXX\" -e \"TOTAL=1000\" -e \"TARGET=/dataset\" \\\n  --oom-kill-disable --memory=16g --memory-swap=16g \\\n  yegor256/cam:0.9.3 \"make -e \u003e/dataset/make.log 2\u003e\u00261\"\n```\n\nThis command will create a new Docker container, running in the background.\n(run `docker ps -a`, in order to see it).\nIf you want to run docker interactively and see all the logs,\nyou can just disable [detached mode][detached]\nby removing the `--detach` option from the command.\n\nThe dataset will be created in the current directory (may take some time,\nmaybe a few days!), and a `.zip` archive will also be there.\nDocker container will run in the background: you can safely close\nthe console and come back when the\ndataset is ready and the container is deleted.\n\nMake sure your server has enough\n[swap memory](https://askubuntu.com/questions/178712/how-to-increase-swap-space)\n(at least 32Gb) and free disk space (at least 512Gb)\n— without this, the dataset will have many errors.\nIt's better to have multiple CPUs, since the entire build process is highly parallel:\nall CPUs will be utilized.\n\nIf the script fails at some point, you can restart it again,\nwithout deleting previously\ncreated files. The process is incremental — it will understand\nwhere it stopped before.\nIn order to restart an entire \"step,\" delete the following directory:\n\n* `github/` to rerun `clone`\n* `temp/jpeek-logs/` to rerun `jpeek`\n* `measurements/` to rerun `measure`\n\nYou can also run it without Docker:\n\n```bash\nmake clean\nmake TOTAL=100\n```\n\nShould work, if you have all the dependencies installed, as suggested in the\n[Dockerfile](https://github.com/yegor256/cam/blob/master/Dockerfile).\n\nIn order to analyze just a single repository, do this\n([`yegor256/tojos`](https://github.com/yegor256/tojos) as an example):\n\n```bash\nmake clean\nmake REPO=yegor256/tojos\n```\n\n## How to Contribute (e.g. by adding a new metric)\n\nFor example, you want to add a new metric to the script:\n\n1. Fork a repository.\n2. Create a new file in the `metrics/` directory,\nusing one of the existing files as an example.\n3. Create a test for your metric, in the `tests/metrics/` directory.\n4. Run the entire test suite\n    (this should take a few minutes to complete, without errors):\n\n    ```bash\n    sudo make install\n    sudo make test lint\n    ```\n\n    -You can also test it with Docker:\n\n    ```bash\n    docker build . -t cam\n    docker run --rm cam make test\n    ```\n\n    There is even a faster way to run all tests, with the help of Docker,\n    if you don't change any installation scripts:\n\n    ```bash\n    docker run -v $(pwd):/c --rm yegor256/cam:0.9.3 make -C /c test\n    ```\n\n5. Send us a\n[pull request](https://www.yegor256.com/2014/04/15/github-guidelines.html).\nWe will review your changes and apply them to the `master` branch shortly,\nprovided they don't violate our quality standards.\n\n## How to Calculate Additional Metrics\n\nYou may want to use this dataset as a basis, with an intend of adding your own\nmetrics on top of it. It should be easy:\n\n* Clone this repo into `cam/` directory\n* Download ZIP archive\n* Unpack it to the `cam/dataset/` directory\n* Add a new script to the `cam/metrics/` directory (use `ast.py` as an example)\n* Delete all other files except yours from the `cam/metrics/` directory\n* Run [`make`](https://www.gnu.org/software/make/) in the `cam/`\ndirectory: `sudo make install; make all`\n\nThe `make` should understand that a new metric was added.\nIt will apply this new metric\nto all `.java` files, generate new `.csv` reports, aggregate them with existing\nreports (in the `cam/dataset/data/` directory),\nand then the final `.pdf` report will also be updated.\n\n## How to Build a New Archive\n\nWhen it's time to build a new archive, create a new `m7i.2xlarge`\nserver (8 CPU, 32Gb RAM, 512Gb disk) with Ubuntu 22.04 in AWS.\n\nThen, install Docker into it:\n\n```bash\nsudo apt update -y\nsudo apt install -y apt-transport-https ca-certificates curl software-properties-common\ncurl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg\necho \"deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\" | sudo tee /etc/apt/sources.list.d/docker.list \u003e /dev/null\nsudo apt update -y\nsudo apt-cache policy docker-ce\nsudo apt install -y docker-ce\nsudo usermod -aG docker ${USER}\n```\n\nThen, add swap memory of 16Gb:\n\n```bash\nsudo dd if=/dev/zero of=/swapfile bs=1048576 count=16384\nsudo chmod 600 /swapfile\nsudo mkswap /swapfile\nsudo swapon /swapfile\n```\n\nThen, create a [personal access token][PAT] in GitHub,\nand run Docker as explained above.\n\n[create-PAT]: https://docs.github.com/en/github/authenticating-to-github/keeping-your-account-and-data-secure/creating-a-personal-access-token\n[PAT]: https://docs.github.com/en/enterprise-server@3.9/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens\n[detached]: https://docs.docker.com/language/golang/run-containers/#run-in-detached-mode\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyegor256%2Fcam","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyegor256%2Fcam","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyegor256%2Fcam/lists"}