https://github.com/yegor256/cam

Classes and Metriсs (CaM): a dataset of Java classes from public open-source GitHub repositories
https://github.com/yegor256/cam

cyclomatic-complexity dataset java metrics metrics-gathering

Last synced: 3 months ago
JSON representation

Classes and Metriсs (CaM): a dataset of Java classes from public open-source GitHub repositories

Host: GitHub
URL: https://github.com/yegor256/cam
Owner: yegor256
License: mit
Created: 2021-05-27T16:25:29.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2024-04-30T16:52:18.000Z (about 1 year ago)
Last Synced: 2024-05-01T23:12:14.168Z (about 1 year ago)
Topics: cyclomatic-complexity, dataset, java, metrics, metrics-gathering
Language: Shell
Homepage: http://cam.yegor256.com
Size: 2.21 MB
Stars: 18
Watchers: 3
Forks: 31
Open Issues: 37
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
- Citation: CITATION.cff

Awesome Lists containing this project

README

        # Classes and Metrics (CaM)

[![arXiv](https://img.shields.io/badge/arXiv-2403.08488-green.svg)](https://arxiv.org/abs/2403.08488)

[![make](https://github.com/yegor256/cam/actions/workflows/make.yml/badge.svg?branch=master)](https://github.com/yegor256/cam/actions/workflows/make.yml)

[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/yegor256/ctors-vs-size/blob/master/LICENSE.txt)

[![Docker Cloud Automated build](https://img.shields.io/docker/cloud/automated/yegor256/cam)](https://hub.docker.com/r/yegor256/cam)

[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=yegor256_cam2&metric=alert_status)](https://sonarcloud.io/summary/new_code?id=yegor256_cam2)

This is a dataset of open source Java classes and some metrics on them.

Every now and then I make a new version of it using the scripts

in this repository. You are welcome to use it in your researches.

Each release has a fixed version. By referring to it in your research

you avoid ambiguity and guarantees repeatability of your experiments.

This is a more formal explanation of this project:

[in PDF](https://arxiv.org/abs/2403.08488).

The latest ZIP archive with the dataset is here:

[cam-2024-03-02.zip](http://cam.yegor256.com/cam-2024-03-02.zip)

(2.22Gb).

There are **48 metrics** calculated for **532,394 Java classes** from

**1000 GitHub repositories**, including:

lines of code (reported by [cloc](https://github.com/AlDanial/cloc));

[NCSS](https://stackoverflow.com/questions/5486983/what-does-ncss-stand-for);

[cyclomatic](https://en.wikipedia.org/wiki/Cyclomatic_complexity) and

[cognitive complexity](https://en.wikipedia.org/wiki/Cognitive_complexity)

(by [PMD](https://pmd.github.io/));

[Halstead](https://en.wikipedia.org/wiki/Halstead_complexity_measures)

volume, effort, and difficulty;

[maintainability index](https://ieeexplore.ieee.org/abstract/document/303623);

number of attributes, constructors, methods;

number of Git authors;

and others ([see PDF](http://cam.yegor256.com/cam-2024-03-02.pdf)).

Previous archives (took me a few days to build each of them, using a pretty big machine):

* [cam-2024-03-02.zip](http://cam.yegor256.com/cam-2024-03-02.zip)

  (2.22Gb): 1000 repos, 48 metrics, 532K classes

* [cam-2023-10-22.zip](http://cam.yegor256.com/cam-2023-10-22.zip)

  (2.19Gb): 1000 repos, 33 metrics, 863K classes

* [cam-2023-10-11.zip](http://cam.yegor256.com/cam-2023-10-11.zip)

  (3Gb): 959 repos, 29 metrics, 840K classes

* [cam-2021-08-04.zip](https://github.com/yegor256/cam/releases/download/0.2.0/cam-2021-08-04.zip)

  (692Mb): 1000 repos, 15 metrics

* [cam-2021-07-08.zip](https://github.com/yegor256/cam/releases/download/0.1.1/cam-2021-07-08.zip)

  (387Mb): 1000 repos, 11 metrics

If you want to create a new dataset,

just run the following command and the entire dataset will

be built in the current directory

(you need to have [Docker](https://docs.docker.com/get-docker/) installed),

where `1000` is the number of repositories to fetch from GitHub

and `XXX` is

your [personal access token][create-PAT]:

```bash

docker run --detach --name=cam --rm --volume "$(pwd):/dataset" \

  -e "TOKEN=XXX" -e "TOTAL=1000" -e "TARGET=/dataset" \

  --oom-kill-disable --memory=16g --memory-swap=16g \

  yegor256/cam:0.9.3 "make -e >/dataset/make.log 2>&1"

```

This command will create a new Docker container, running in the background.

(run `docker ps -a`, in order to see it).

If you want to run docker interactively and see all the logs,

you can just disable [detached mode][detached]

by removing the `--detach` option from the command.

The dataset will be created in the current directory (may take some time,

maybe a few days!), and a `.zip` archive will also be there.

Docker container will run in the background: you can safely close

the console and come back when the

dataset is ready and the container is deleted.

Make sure your server has enough

[swap memory](https://askubuntu.com/questions/178712/how-to-increase-swap-space)

(at least 32Gb) and free disk space (at least 512Gb)

— without this, the dataset will have many errors.

It's better to have multiple CPUs, since the entire build process is highly parallel:

all CPUs will be utilized.

If the script fails at some point, you can restart it again,

without deleting previously

created files. The process is incremental — it will understand

where it stopped before.

In order to restart an entire "step," delete the following directory:

* `github/` to rerun `clone`

* `temp/jpeek-logs/` to rerun `jpeek`

* `measurements/` to rerun `measure`

You can also run it without Docker:

```bash

make clean

make TOTAL=100

```

Should work, if you have all the dependencies installed, as suggested in the

[Dockerfile](https://github.com/yegor256/cam/blob/master/Dockerfile).

In order to analyze just a single repository, do this

([`yegor256/tojos`](https://github.com/yegor256/tojos) as an example):

```bash

make clean

make REPO=yegor256/tojos

```

## How to Contribute (e.g. by adding a new metric)

For example, you want to add a new metric to the script:

1. Fork a repository.

2. Create a new file in the `metrics/` directory,

using one of the existing files as an example.

3. Create a test for your metric, in the `tests/metrics/` directory.

4. Run the entire test suite

    (this should take a few minutes to complete, without errors):

    ```bash

    sudo make install

    sudo make test lint

    ```

    -You can also test it with Docker:

    ```bash

    docker build . -t cam

    docker run --rm cam make test

    ```

    There is even a faster way to run all tests, with the help of Docker,

    if you don't change any installation scripts:

    ```bash

    docker run -v $(pwd):/c --rm yegor256/cam:0.9.3 make -C /c test

    ```

5. Send us a

[pull request](https://www.yegor256.com/2014/04/15/github-guidelines.html).

We will review your changes and apply them to the `master` branch shortly,

provided they don't violate our quality standards.

## How to Calculate Additional Metrics

You may want to use this dataset as a basis, with an intend of adding your own

metrics on top of it. It should be easy:

* Clone this repo into `cam/` directory

* Download ZIP archive

* Unpack it to the `cam/dataset/` directory

* Add a new script to the `cam/metrics/` directory (use `ast.py` as an example)

* Delete all other files except yours from the `cam/metrics/` directory

* Run [`make`](https://www.gnu.org/software/make/) in the `cam/`

directory: `sudo make install; make all`

The `make` should understand that a new metric was added.

It will apply this new metric

to all `.java` files, generate new `.csv` reports, aggregate them with existing

reports (in the `cam/dataset/data/` directory),

and then the final `.pdf` report will also be updated.

## How to Build a New Archive

When it's time to build a new archive, create a new `m7i.2xlarge`

server (8 CPU, 32Gb RAM, 512Gb disk) with Ubuntu 22.04 in AWS.

Then, install Docker into it:

```bash

sudo apt update -y

sudo apt install -y apt-transport-https ca-certificates curl software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt update -y

sudo apt-cache policy docker-ce

sudo apt install -y docker-ce

sudo usermod -aG docker ${USER}

```

Then, add swap memory of 16Gb:

```bash

sudo dd if=/dev/zero of=/swapfile bs=1048576 count=16384

sudo chmod 600 /swapfile

sudo mkswap /swapfile

sudo swapon /swapfile

```

Then, create a [personal access token][PAT] in GitHub,

and run Docker as explained above.

[create-PAT]: https://docs.github.com/en/github/authenticating-to-github/keeping-your-account-and-data-secure/creating-a-personal-access-token

[PAT]: https://docs.github.com/en/[email protected]/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens

[detached]: https://docs.docker.com/language/golang/run-containers/#run-in-detached-mode

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yegor256/cam

Awesome Lists containing this project

README