{"id":23152551,"url":"https://github.com/khairulislam/predict-code-changes","last_synced_at":"2025-08-12T16:08:37.604Z","repository":{"id":155188382,"uuid":"300333911","full_name":"khairulislam/Predict-Code-Changes","owner":"khairulislam","description":"Predict the merge probability of Gerrit code changes","archived":false,"fork":false,"pushed_at":"2022-02-09T05:20:38.000Z","size":175364,"stargazers_count":0,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-08T12:07:53.642Z","etag":null,"topics":["code-review","gerrit-miner","gerrit-parser","predict-changes"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/khairulislam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-01T15:45:55.000Z","updated_at":"2022-02-09T05:20:45.000Z","dependencies_parsed_at":null,"dependency_job_id":"fa814b91-6475-47bf-87bb-2d4ae9801795","html_url":"https://github.com/khairulislam/Predict-Code-Changes","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/khairulislam/Predict-Code-Changes","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/khairulislam%2FPredict-Code-Changes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/khairulislam%2FPredict-Code-Changes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/khairulislam%2FPredict-Code-Changes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/khairulislam%2FPredict-Code-Changes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/khairulislam","download_url":"https://codeload.github.com/khairulislam/Predict-Code-Changes/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/khairulislam%2FPredict-Code-Changes/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270092128,"owners_count":24525330,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-12T02:00:09.011Z","response_time":80,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code-review","gerrit-miner","gerrit-parser","predict-changes"],"created_at":"2024-12-17T19:16:12.708Z","updated_at":"2025-08-12T16:08:32.582Z","avatar_url":"https://github.com/khairulislam.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Early Prediction for Merged vs Abandoned CodeChanges in Modern Code Reviews\n\nThis is a tool to predict whether a code change request will be merged or abandoned as\nsoon as it has been uploaded and can update its predictions with each new revision. The tool is developed for Gerrit based code review systems. Its objective is to help\nthe code reviewers prioritize change requests based on the probability of getting merged.\nHence, saving their time and efforts. \u003cb\u003e We call this tool PredCR.  PredCR is a LightGBM\nbased machine learning classifier which can,\u003c/b\u003e\n\n* Predict whether a code change request will be merged with on average 85% AUC score as soon as the\nchange request is submitted and assigned to a reviewer. Improving the state-of-the-art [[1]](#1)\n  by 18-28%.\n* Even for new authors, predict merge probability with on average 78% AUC score. Improving the state-of-the-art [[1]](#1)\n  by 21-31%.\n* Provides two adjusted approaches to update prediction for each new revision of the same change request,\n  and maintains significant results.\n  \n* Complete training within several seconds, hence feasible to use in real world projects.\n  \nWe have mined changes from the following Gerrit projects\n\n* [Eclipse](https://git.eclipse.org/r)\n* [Gerrithub](https://review.gerrithub.io)\n* [Libreoffice](https://gerrit.libreoffice.org)\n\n## Project Structure\n\nThe root folder contains the [`Config`](Config.py) file.\n\n* \u003cb\u003e Config\u003c/b\u003e : Basic configuration for data path and models. Reduce the number of multiple runs here if\n    you want the results fast. Change which project to run models on.\n\nThere are three subdirectories of the root folder.\n\n### [1. Data](Data)\n\nList of subdirectories:\n\n* Eclipse\n* Gerrithub\n* Libreoffice\n\nEach project directory contains their features and experimentation\n  results. After mining the raw data, they will be stored here also. For\n  file size limit in GitHub we are unable to upload them here. But they are\nshared in this [Google Drive](https://drive.google.com/drive/folders/1z2KmxgYNgO5sNBHZLb_Nm43bFqH4vi2g?usp=sharing).\n  You can download the raw dataset for a single project too from there and calculate the features from scratch if\n  you want. Just unzip the files and keep the folder structure same as shown in Section [Mining](#mining).\n\n### [2. Results](Results)\n\nContains csv result files for each project in their respective subdirectory. Each folder has feature importance,\ntrain and test results for each fold, overall results for our and Fan et al.'s [[1]](#1) model.\n\n### [3. Source](Source)\n\nContains source codes necessary for the project. It has three subdirectories :\n\n* Experiments\n* Feature Calculator\n* Miners\n\nIt also has the [`Util.py`](Source/Util.py) file. Which contains some util methods used by other files.\n\n#### [3.1 Experiments](Source/Experiments)\n\nSource codes for running all the experiments mentioned in the paper.\n\n* \u003cb\u003eCalculate developer effort\u003c/b\u003e: Calculates developer effort in terms of duration of days, number of messages and number of changes per code change.\n* \u003cb\u003eComplete mining process\u003c/b\u003e: Contains complete raw change data mining steps\n    (except file diff).\n\n* \u003cb\u003eCross project validation\u003c/b\u003e: Calculates model performance across projects.\n* \u003cb\u003eDNN model\u003c/b\u003e: Contains the DNN model we built to find the best classifier for change prediction.\n\n#### [3.2 Feature Calculators](Source/Feature%20Calculators)\n\nContains feature calculation related files.\n\n* \u003cb\u003eFeature calculator\u003c/b\u003e: Calculates feature sets from raw data created after\n  mining.\n* \u003cb\u003eFeature calculator for Fan\u003c/b\u003e: Calculates feature for state-of-the-art work by\n  Fan et al.[[1]](#1). Their shared repository can be found [here](https://github.com/YuanruiZJU/EarlyPredictionReview).\n\n* \u003cb\u003eFeature calculator for multiple revisions\u003c/b\u003e: Calculates features when prediction\n  is updated for each new revision of the change request.\n\n* \u003cb\u003eLongitudinal 10-fold cross validation\u003c/b\u003e: Runs each of the experiments presented is our\n  paper with longitudinal cross validation setup.\n\n* \u003cb\u003eLongitudinal 10-fold cross validation - Fan\u003c/b\u003e: Runs each of the experiments for Fan's [[1]](#1)\n  work we have compared in our paper, with longitudinal cross validation setup.\n\n#### [3.3 Miners](Source/Miners)\n\nContains the files necessary to mine the raw code changes and related data from Gerrit projects.\n\n* \u003cb\u003eMine file diff \u003c/b\u003e: Mines file diff data for first revision of each selected code changes.\n  Used later to calculated code segment related features.\n\n* \u003cb\u003eMiner\u003c/b\u003e: Contains the miner class implementation, used to mine code changes from Gerrit.\n* \u003cb\u003e SimpleParser \u003c/b\u003e: Parses the json responses from Gerrit server and return them in Class.\n  \n## How to run\n\nOpen `Predict-Code-Change` as project in Pycharm or any other python IDE. People interested in just testing the\ntool should directly jump to [Experimentation](#exp) section. All the data need for running the experiments are\nalready uploaded. If you want to run them on your own mined dataset, complete the following two sections first.\n\n### \u003ca id=\"mining\"\u003eMining\u003c/a\u003e\n\n* Run the [`Source/Miners/Complete Mining Process.py`](Source/Miners/Complete%20mining%20process.py) file to start mining.\n* Set the project name, make sure Gerrit class has corresponding Gerrit server address for this project.\n* Check if the directories for the data to be dumbed is created.\n* Set the start and end time period, changes created and closed in that period will be collected. Make\n  sure the time format is valid. Check existing code in [`Source/Miners/Miner.py`](Source/Miners/Miner.py) for example.\n* This step is long and time-consuming. Specially, because during downloading large chunks of\ndata, Gerrit servers randomly close the connection, and you have to rerun the miner several times\ntill it is successful in mining all changes within a period.\n\n* \u003cb\u003e For best experience, run each steps in the miner individually. \u003c/b\u003e When you want to run one, comment out the others.\nGerrit change response collected by [`Source/Miners/Miner.py`](Source/Miners/Miner.py) doesn't contain file contents.\n* Run the [`Source/Miners/Mine file diff.py`](Source/Miners/Mine%20file%20diff.py) after completing previous steps, to mine\nfile contents for first revision of each selected change request.\n* This miner doesn't batch download, so expect a looooong time to finish\nmining changes. Also, occasionally Gerrit will close connections or send\n  response not found messages. Rerunning the miner might fix that issue sometimes.\n  \n* With mined data project structure will look similar to this for each project\n  * Eclipse\n    * change : Batch of change requests\n    * changes: Individual change requests.\n    * diff: File diff content for first revision of each change request.\n    * profile: Profile of Gerrit authors.\n\n### \u003ca id=\"feature_calculation\"\u003eFeature calculation\u003c/a\u003e\n\n* This step can run only after completing previous mining steps.\n* Currently, our raw data isn't added here.\n* Run [`Source/Feature Calculators/Feature calculator.py`](Source/Feature%20Calculators/Feature%20calculator.py) to\n calculate features from raw data. It sorts selected changes from `Project_selected_change_list.csv` and searches\n for their corresponding mined files to calculate feature.\n* So for each change in the selected list, the followings must be present before\n  calculating feature. An example is present in Eclipse project.\n  * Project\n    * changes\n      * Project_changeNumber_change.json\n    * diff\n      * Project_changeNumber_diff.json\n    * profile\n      * profile_accountId.json\n  \n### \u003ca id=\"exp\"\u003eExperimentation\u003c/a\u003e\n\n* Set config values in file [`Config.py`](Config.py). For example project, number of runs, folds,\n  feature list, data path, seed.\n  \n* Running [`Source/Experiments/Longitudinal 10 fold cross validation.py`](Source/Experiments/Longitudinal%2010%20fold%20cross%20validation.py) with run our model experiments for `project`\n  as specified in [`Config.py`](Config.py) file, `runs` times, using `folds` number of folds. It will generate\n  the following files in `Data\\project` folder :\n  * \u003cb\u003eproject_train_result_cross.csv \u003c/b\u003e: Average train performance for each fold.\n    *\u003cb\u003e project_test_result_cross.csv \u003c/b\u003e:Average test performance for each fold.\n  * \u003cb\u003eproject_result_cross.csv\u003c/b\u003e: Overall average performance across folds.\n  * \u003cb\u003eproject_feature_importance_cross.csv\u003c/b\u003e: Average feature importance calculated from LightGBM\n    `feature_importances_` attribute.\n  \n* Similarly, running [`Source/Experiments/Longitudinal 10 fold cross validation - Fan.py`](Source/Experiments/Longitudinal%2010%20fold%20cross%20validation.py) with run Fan's [[1]](#1)\n  model experiments for parameters specified in [`Config.py`](Config.py) file.\n\n## Citation\n\nPaper link : [IST](https://www.sciencedirect.com/science/article/abs/pii/S0950584921002032), [arxiv](https://arxiv.org/pdf/1912.03437.pdf).\n\n```bash\n@article{ISLAM2022106756,\ntitle = {Early prediction for merged vs abandoned code changes in modern code reviews},\njournal = {Information and Software Technology},\nvolume = {142},\npages = {106756},\nyear = {2022},\nissn = {0950-5849},\ndoi = {https://doi.org/10.1016/j.infsof.2021.106756},\nurl = {https://www.sciencedirect.com/science/article/pii/S0950584921002032},\nauthor = {Khairul Islam and Toufique Ahmed and Rifat Shahriyar and Anindya Iqbal and Gias Uddin}\n}\n```\n\n## References\n\n\u003ca id=\"1\"\u003e[1]\u003c/a\u003e\n[Y. Fan, X. Xia, D. Lo, S. Li, Early prediction of merged code changes to prioritizereviewing tasks,\nEmpirical Software Engineering (2018) 1–48.](https://link.springer.com/content/pdf/10.1007/s10664-018-9602-0.pdf)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkhairulislam%2Fpredict-code-changes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkhairulislam%2Fpredict-code-changes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkhairulislam%2Fpredict-code-changes/lists"}