{"id":13732355,"url":"https://github.com/wogscpar/SZZUnleashed","last_synced_at":"2025-05-08T06:32:02.223Z","repository":{"id":33925874,"uuid":"132736067","full_name":"wogscpar/SZZUnleashed","owner":"wogscpar","description":"An implementation of the SZZ algorithm, i.e., an approach to identify bug-introducing commits.","archived":false,"fork":false,"pushed_at":"2023-10-04T19:13:06.000Z","size":6451,"stargazers_count":108,"open_issues_count":13,"forks_count":76,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-08-04T02:11:02.272Z","etag":null,"topics":["defect-prediction","git","mining-software-repositories","software-engineering-research","szz","szz-algorithm"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wogscpar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-05-09T09:47:32.000Z","updated_at":"2024-07-01T06:36:23.000Z","dependencies_parsed_at":"2022-08-07T23:30:46.247Z","dependency_job_id":null,"html_url":"https://github.com/wogscpar/SZZUnleashed","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wogscpar%2FSZZUnleashed","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wogscpar%2FSZZUnleashed/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wogscpar%2FSZZUnleashed/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wogscpar%2FSZZUnleashed/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wogscpar","download_url":"https://codeload.github.com/wogscpar/SZZUnleashed/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224708197,"owners_count":17356497,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["defect-prediction","git","mining-software-repositories","software-engineering-research","szz","szz-algorithm"],"created_at":"2024-08-03T02:01:54.038Z","updated_at":"2024-11-14T23:32:10.454Z","avatar_url":"https://github.com/wogscpar.png","language":"Java","readme":"# SZZ Unleashed\n\nSZZ Unleashed is an implementation of the SZZ algorithm, i.e. an approach to identify bug-introducing commits, introduced by Śliwerski et al's in [\"When Do Changes Induce Fixes?\"](https://www.st.cs.uni-saarland.de/papers/msr2005/), in *Proc. of the International Workshop on Mining Software Repositories*, May 17, 2005. \nThe implementation uses \"line number mappings\" as proposed by Williams and Spacco in [\"SZZ Revisited: Verifying When Changes Induce Fixes\"](https://www.researchgate.net/publication/220854597_SZZ_revisited_verifying_when_changes_induce_fixes), in *Proc. of the Workshop on Defects in Large Software Systems*, July 20, 2008.\n\nThis repository responds to the call for public SZZ implementations by Rodríguez-Pérez, Robles, and González-Barahona. [\"Reproducibility and Credibility in Empirical Software Engineering: A Case Study Based on a Systematic Literature Review of the use of the SZZ Algorithm\"](https://www.researchgate.net/publication/323843822_Reproducibility_and_Credibility_in_Empirical_Software_Engineering_A_Case_Study_based_on_a_Systematic_Literature_Review_of_the_use_of_the_SZZ_algorithm), *Information and Software Technology*, Volume 99, 2018.\n\nIf you find SZZ Unleashed useful for your research, please cite our paper:\n- Borg, M., Svensson, O., Berg, K., \u0026 Hansson, D., SZZ Unleashed: An Open Implementation of the SZZ Algorithm - Featuring Example Usage in a Study of Just-in-Time Bug Prediction for the Jenkins Project. In *Proc. of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE)*, pp. 7-12, 2019. arXiv preprint [arXiv:1903.01742](https://arxiv.org/abs/1903.01742).\n\n# Table of Contents\n1. [Background](#background)\n2. [Running SZZ Unleashed](#szz_usage)\n3. [SZZ Unleashed with Docker](#szz_docker)\n4. [Example Application: Training a Classifier for Just-in-Time Bug Prediction](#feat_extract)\n5. [Examples and executables](#examples_n_exec)\n6. [Authors](#authors)\n\n## Background \u003ca name=\"background\"\u003e\u003c/a\u003e\n\nThe SZZ algorithm is used to find bug-introducing commits from a set of bug-fixing commits. \nThe bug-introducing commits can be extracted either from a bug tracking system such as Jira or simply by searching for commits that state that they are fixing something. The identified bug-introducing commits can then be used to support empirical software engineering research, e.g., defect prediction or software quality. As an example, this implementation has been used to collect training data for a machine learning-based approach to risk classification of individual commits, i.e., training a random forest classifier to highlight commits that deserve particularily careful code review. The work is described in a [MSc. thesis from Lund University](https://www.lunduniversity.lu.se/lup/publication/8971266).\n\n## Running SZZ Unleashed \u003ca name=\"szz_usage\"\u003e\u003c/a\u003e\nBuilding and running SZZ Unleashed requires Java 8 and Gradle. Python is required to run the supporting scripts and Docker must be installed to use the provided Docker images. All scripts and compilations has been tested on Linux and Mac, and partly on Windows 10.\n\nThe figure shows a suggested workflow consisting of four steps. Step 1 and Step 2 are pre-steps needed to collect and format required data. Step 3 is SZZ Phase 1, i.e., identifying bug-fixing commits. Step 4 is SZZ Phase 2, i.e., identifying bug-introducing commits. Steps 1-3 are implemented in Python scripts, whereas Step 4 is implemented in Java.\n\n![SZZ Unleashed workflow](/workflow.png) \u003ca name=\"workflow\"\u003e\u003c/a\u003e\n\n### Step 1. Fetch issues (SZZ pre-step) ###\nTo get issues one needs a bug tracking system. As an example the project Jenkins uses [Jira](https://issues.jenkins-ci.org).\nFrom here it is possible to fetch issues that we then can link to bug fixing commits.\n\nWe have provided an example script that can be used to fetch issues from Jenkins issues (see 1 in the [figure](#workflow)). In the directory fetch_jira_bugs, one can find the **fetch.py** script. The script has a jql string which is used as a filter to get certain issues. Jira provides a neat way to test these jql strings directly in the [web page](https://issues.jenkins-ci.org/browse/JENKINS-41020?jql=). Change to the advanced view and then enter the search criteria. Notice that the jql string is generated in the browser's url bar once enter is hit.\n\nTo fetch issues from Jenkins Jira, just run:\n```python\npython fetch.py --issue-code \u003cissue_code\u003e --jira-project \u003cjira_project_base_url\u003e\n```\npassing as parameters the code used for the project issues on Jira and the name of the Jira repository of the project (e.g., _issues.jenkins-ci.org_). The script creates a directory with issues (see issues folder in the [figure](#workflow)). These issues will later on be used by the `find_bug_fixes.py` script. \n\nA more thorough example of this script can be found [here](./examples/Fetch.md).\n\n### Step 2. Preprocess the git log output (SZZ pre-step) ###\nSecond we need to convert the `git log` output to something that can be processed. That requires a local copy of the repository that we aim to analyze, [Jenkins Core Repository](https://github.com/jenkinsci/jenkins). Once cloned, one can now run the **git_log_to_array.py** script (see 2 in the [figure](#workflow)). The script requires an absolute path to the cloned repository and a SHA-1 for an initial commit.\n```python\npython git_log_to_array.py --from-commit \u003cSHA-1_of_initial_commit\u003e --repo-path \u003cpath_to_local_repo\u003e\n```\nOnce executed, this creates a file `gitlog.json` that can be used together with issues that we created with the `fetch.py` script. \n\nAn example of this script and what it produces can be found [in the examples](./examples/GitlogToArray.md).\n\n### Step 3. Identify bug-fixing commits (SZZ Phase 1) ###\nNow, using the `find_bug_fixes.py` (see 3 in the [figure](#workflow)) and this file, we can get a json file\nthat contains the issue and its corresponding commit SHA-1, the commit date, the creation date and the resolution date. Just run:\n```python\npython find_bug_fixes.py --gitlog \u003cpath_to_gitlog_file\u003e --issue-list \u003cpath_to_issues_directory\u003e --gitlog-pattern \"\u003ca_pattern_for_matching_fixes\u003e\"\n```\nThe output is `issue_list.json` which is later used in the SZZ algorithm.\n\nAn example output of this script can be found in [the examples](./examples/FindBugFixes.md).\n\n### Identify bug-introducing commits (SZZ Phase 2) ###\nThis implementation works regardless which language and file type. It uses\n[JGIT](https://www.eclipse.org/jgit/) to parse a git repository.\n\nTo build a runnable jar file, use the gradle build script in the szz directory\nlike:\n\n```shell\ngradle build \u0026\u0026 gradle fatJar\n```\n\nOr if the algorithm should be run without building a jar:\n\n```shell\ngradle build \u0026\u0026 gradle runJar\n```\n\nThe algorithm tries to use as many cores as possible during runtime.\n\nTo get the bug introducing commits from a repository using the file produced\nby the previous issue to bug fix commit step, run (see 4 in the [figure](#workflow)):\n\n```shell\njava -jar szz_find_bug_introducers-\u003cversion_number\u003e.jar -i \u003cpath_to_issue_list.json\u003e -r \u003cpath_to_local_repo\u003e\n```\n\n## Output from SZZ Unleashed\nAs shown in the [figure](#workflow), the output consists of three different files: `commits.json`,\n`annotations.json` and `fix_and_bug_introducing_pairs.json`.\n\nThe `commits.json` file includes all commits that have been blamed to be bug\nintroducing but which haven't been analyzed by anything.\n\nThe `annotations.json` is a representation of the graph that is generated by the\nalgorithm in the blaming phase. Each bug fixing commit is linked to all possible\ncommits which could be responsible for the bug. Using the improvement from\nWilliams et al's, the graph also contains subgraphs which gives a deeper search\nfor responsible commits. It enables the algorithm to blame other commits than\njust the one closest in history for a bug.\n\nLastly, the `fix_and_bug_introducing_pairs.json` includes all possible pairs\nwhich could lead to a bug introduction and fix. This file is not sorted in any\nway and it includes duplicates when it comes to both introducers and fixes. A\nfix can be made several times and a introducer could be responsible for many\nfixes.\n\n## Configuring SZZ Unleashed\n\nA description of how to configure SZZUnleashed further can be found in [the examples](./examples/BugIntroducersFinder.md).\n\n## Use SZZ Unleashed with Docker \u003ca name=\"szz_docker\"\u003e\u003c/a\u003e\n\nA more thorough instruction in using Docker to produce the results can be found in [doc/Docker.md](doc/Docker.md). Below is a very brief instruction.\n\nThere exists a *Dockerfile* in the repository. It contains all the steps in chronological order that is needed to generate the `fix_and_bug_introducing_pairs.json`. Simply run this command in the directory where the Dockerfile is located:\n\n```bash\ndocker build -t ssz .\n```\n\nThen start a temporary docker container:\n```bash\ndocker run -it --name szz_con szz ash\n```\nIn this container it is possible to study the results from the algorithm. The results are located in *./szz/results*.\n\nLastly, to copy the results from the container to your own computer run:\n```bash\ndocker cp szz_con:/root/szz/results .\n```\n\nNote that the temporary container must be running while the *docker cp* command is executed. To be sure, check that the *szz_con* is listed when running:\n```bash\ndocker ps\n```\n\n## Example Application: Training a Classifier for Just-in-Time Bug Prediction \u003ca name=\"feat_extract\"\u003e\u003c/a\u003e\nTo illustrate what the output from SZZ Unleashed can be used for, we show how to train a classifier for Just-in-Time Bug prediction, i.e., predicting if individual commits are bug-introducing or not. We now have a set of bug-introducing commits and a set or correct commits. We proceed by representing individual commits by a set of features, based on previous research on bug prediction. \n\n### Code Churns ###\nThe most simple features are the code churns. These are easily extracted by\njust parsing each diff for each commit. The ones that are extracted are:\n\n1. **Total lines of code** - Which simply is how many lines of code in total for all changed files.\n2. **Churned lines of code** - This is how many lines that have been inserted.\n3. **Deleted lines of code** - The number of deleted lines.\n4. **Number of Files** - The total number of changed files.\n\nTo get these features, run: `python assemble_code_churns.py \u003cpath_to_repo\u003e \u003cbranch\u003e`\n\n### Diffusion Features ###\nThe diffusion features are:\n\n1. The number of modified subsystems.\n2. The number of modified subdirectories.\n3. The entropy of the change.\n\nTo extract the diffusion features, just run:\n`python assemble_diffusion_features.py --repository \u003cpath_to_repo\u003e --branch \u003cbranch\u003e`\n\n### Experience Features ###\nMaybe the most sensitive feature group. The experience features are the\nfeatures that measure how much experience a developer has, calculated based on both overall \nactivity in the repository and recent activity.\n\nThe features are:\n\n1. Overall experience.\n2. Recent experience.\n\nThe script builds a graph to keep track of each authors experience. The initial\nrun is:\n`python assemble_experience_features.py --repository \u003crepo_path\u003e --branch \u003cbranch\u003e --save-graph`\n\nThis results in a graph that the script below uses for future analysis\n\nTo rerun the analysis without generating a new graph, just run:\n`python assemble_experience_features.py --repository \u003crepo_path\u003e --branch \u003cbranch\u003e`\n\n### History Features ###\nThe history is represented by the following:\n\n1. The number of authors in a file.\n2. The time between contributions made by the author.\n3. The number of unique changes between the last commit.\n\nAnalogous to the experience features, the script must initially generate a graph\nwhere the file meta data is saved.\n`python assemble_history_features.py --repository \u003crepo_path\u003e --branch \u003cbranch\u003e --save-graph`\n\nTo rerun the script without generating a new graph, use:\n`python assemble_history_features.py --repository \u003crepo_path\u003e --branch \u003cbranch\u003e`\n\n### Purpose Features ###\nThe purpose feature is just a binary feature representing whether a commit is a fix or\nnot. This feature can be extracted by running:\n\n`python assemble_purpose_features.py --repository \u003crepo_path\u003e --branch \u003cbranch\u003e`\n\n### Coupling ###\nA more complex type of features are the coupling features. These indicate\nhow strong the relation is between files and modules for a revision. This means\nthat two files can have a relation even though they don't have a relation\ninside the source code itself. By mining these, features that give\nindications of how many files that a commit actually has made changes to are\nfound.\n\nThe mining is made by a Docker image containing the tool code-maat.\n\nNote that calculating these features is time-consuming. They are extracted by:\n\n```python\npython assemble_features.py --image code-maat --repo-dir \u003cpath_to_repo\u003e --result-dir \u003cpath_to_write_result\u003e\npython assemble_coupling_features.py \u003cpath_to_repo\u003e\n```\n\nIt is also possible to specify which commits to analyze. This is done with the\nCLI option `--commits \u003cpath_to_file_with_commits\u003e`. The format of this file is\njust lines where each line is equal to the corresponding commit SHA-1.\n\nIf the analysis is made by several Docker containers, one has to specify\nthe `--assemble` option which stands for assemble. This will collect and store\nall results in a single directory.\n\nThe script can check if there are any commits that haven't been\nanalyzed. To do that, specify the `--missing-commits` option.\n\n## Classification ##\nNow that all features have been extracted, the training and testing of the machine learning classifier can\nbe made. In this example, we train a random forest classifier. To do this, run the model script in the model directory:\n```python\npython model.py train\n```\n\n## Examples and executables \u003ca name=\"examples_n_exec\"\u003e\u003c/a\u003e\n\nIn [the examples](./examples) directory, one can find documents containing descriptions about each script. There is also [a data directory](./examples/data) containing data produced by the scripts. It can be used to either study how the output should look like or if anyone just wants a dataset to train on.\n\n## Authors \u003ca name=\"authors\"\u003e\u003c/a\u003e\n\n[Oscar Svensson](mailto:wgcp92@gmail.com)\n[Kristian Berg](mailto:kristianberg.jobb@gmail.com)\n","funding_links":[],"categories":["Tools"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwogscpar%2FSZZUnleashed","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwogscpar%2FSZZUnleashed","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwogscpar%2FSZZUnleashed/lists"}