{"id":13716147,"url":"https://github.com/cedricrupb/TSSB3M","last_synced_at":"2025-05-07T05:32:17.653Z","repository":{"id":91841079,"uuid":"438166935","full_name":"cedricrupb/TSSB3M","owner":"cedricrupb","description":"Mining tool and large-scale datasets of single statement bug fixes in Python","archived":false,"fork":false,"pushed_at":"2023-11-29T10:29:32.000Z","size":314,"stargazers_count":14,"open_issues_count":0,"forks_count":5,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-08-04T00:11:52.274Z","etag":null,"topics":["bugs","dataset","mining"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cedricrupb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-12-14T08:02:34.000Z","updated_at":"2023-03-19T13:29:18.000Z","dependencies_parsed_at":"2023-04-01T06:04:34.633Z","dependency_job_id":null,"html_url":"https://github.com/cedricrupb/TSSB3M","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedricrupb%2FTSSB3M","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedricrupb%2FTSSB3M/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedricrupb%2FTSSB3M/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedricrupb%2FTSSB3M/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cedricrupb","download_url":"https://codeload.github.com/cedricrupb/TSSB3M/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224567475,"owners_count":17332828,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bugs","dataset","mining"],"created_at":"2024-08-03T00:01:07.533Z","updated_at":"2024-11-14T04:31:07.354Z","avatar_url":"https://github.com/cedricrupb.png","language":"Python","funding_links":[],"categories":["Data Sets and Benchmarks"],"sub_categories":["Bug Localization"],"readme":"# TSSB-3M: Mining single statement bugs at massive scale\n\u003e Mining tool and large-scale datasets of single statement bug fixes in Python\n\n[[**PAPER**](https://arxiv.org/abs/2201.12046) | [**DATASETS**](#datasets) | [**CODE ARTIFACT**](https://doi.org/10.5281/zenodo.5898547)]\n\nAccess to single statement bug fixes at massive scale is not only important for exploring how developers introduce bugs in code and fix them but it is also\na valuable resource for research in data-driven\nbug detection and automatic repair. Therefore, we are releasing multiple large-scale collections of single statement bug fixes mined from public Python repositories.\n\n## :warning: Deduplicated Datasets\nWe came to notice that our datasets contain a significant number of duplicate patches that were missed by our deduplication procedure. To mitigate this, we are releasing cleaned versions of **TSSB-3M** and **SSB-9M**:\n\n* [**CTSSB-1M**](https://tssb3m.s3.eu-west-1.amazonaws.com/ctssb_data_1M.zip) A cleaned version of TSSB-3M containing nearly a million isolated single statement bug fixes. \n\n* [**CSSB-2.6M**](https://tssb3m.s3.eu-west-1.amazonaws.com/cssb_data_2_6M.zip) A cleaned version of SSB-9M containing over 2.6 million single statement bug fixes.\n\nTo obtain the cleaned versions of the two datasets we implemented a more aggressive deduplication scheme (see `run_udiff_deduplication.py`). The cleaned datasets are also available on [Zenodo](https://doi.org/10.5281/zenodo.10217373). Statistics of the new datasets can be found [below](#statistics).\n\n## Datasets\nTo facilitate future research, we are releasing three\ndatasets:\n* [**TSSB-3M:**](https://tssb3m.s3.eu-west-1.amazonaws.com/tssb_data_3M.zip) A dataset of over 3 million isolated single statement bug fixes. Each bug fix is related to a commit in a public Python that does not change more \nthan a single statement.\n\n* [**SSB-9M:**](https://tssb3m.s3.eu-west-1.amazonaws.com/ssb_data_9M.zip) A dataset of over 9 million single statement bug fixes. Each fix modifies at least a single statement to fix a bug. However, the related code changes might incorporate changes to other files.\n\n* [**SSC-28M:**](https://tssb3m.s3.eu-west-1.amazonaws.com/ssc_data_28M.zip) A dataset of over 28 million general single statement changes. We are releasing this dataset with the intention to faciliate research in software evolution. Therefore, a code change might not necessarily relate to a bug fix.\n\nAll datasets are available at [Zenodo](https://doi.org/10.5281/zenodo.10217373).\n\nThe datasets were collected for our research project related to:\n```bibtex\n@inproceedings{richter2022tssb,\n  title={TSSB-3M: Mining single statement bugs at massive scale},\n  author={Cedric Richter, Heike Wehrheim},\n  booktitle={MSR},\n  year={2022}\n}\n```\n\n## Mining tool\nThis project has lead to muliple open source libraries released in indepedent repositories:\n\n* [code.diff:](https://github.com/cedricrupb/code_diff) A library for fast AST based code differencing. The library is employed to compute\nAST edit script between code changes and the detection\nof SStuB patterns.\n\n* [code.tokenize:](https://github.com/cedricrupb/code_tokenize) A library for fast tokenization and AST analysis of program code. This library was mainly developed for parsing source code during code differencing and is therefore the base for code.diff.\n\nThis repository additionaly includes all scripts used for mining single line edits and for filtering the datasets for single statement bug fixes. A description of the mining process can be found [below](#Mining-Process).\n\n## Quick start (for using the datasets)\nWe provide our datasets as sets of commits referenced by URLs and git SHAs\nand annotated with additional analytical information. All entries are stored in jsonlines format where each entry contains the following information:\n```json\n{\n  \"project_url\": \"URL of project containing the commit\",\n  \"commit_sha\" : \"commit SHA of the code change\",\n  \"file_path\"  : \"File path of the changed source file\",\n  \"diff\"       : \"Universal diff of the code change\",\n  ...\n}\n```\nA more detailed overview can be found [here](#json-fields). While this data contained in our datasets can be sufficient for most use cases, we sometimes which to extract the exact code from the original project. Therefore, we provide a `get_python_bugs.py` script that provides a frame implementation for extracting the code before and after the bug fix included in our datasets. The script automatically reads the datasets and clones the original repositories (thanks to [PyDriller](https://github.com/ishepard/pydriller)). The `visit_buggy_commit` need to be implemented:\n\n* `visit_buggy_commit` is called on the referenced commit. Information like the code before and after the commit can be obtained\nby processing the available PyDriller objects. Results of the mining process can be automatically stored by just returning JSON dict which\nis then stored in a jsonlines format.\n\nNote however that cloning all datasets might require multiple days (or month) on a single machine. Therefore, filtering the dataset beforehand might be necessary.\n\n## Dataset Info\nIn the following, we provide an overview over central\nstatistics of the released datasets and description of the stored\ndataset entries.\n\n### Statistics\n\nSStuB statistic:\n\nPattern Name\t| CTSSB-1M |\tTSSB-3M|\tSSB-9M     \n----------------|----------------|----------------|-----------------------\n| Change Idenfier Used  | 69K\t|   237K\t|      659K      \t\n| Change Binary Operand | 48K\t|   174K\t|      349K      \n| Same Function More Args | 41K\t|   150K\t|      457K   \n| Wrong Function Name   | 39K\t|   134K\t|      397K\n| Add Function Around Expression | 32K\t|   117K\t|      244K \n| Change Attribute Used  | 30K\t|   104K\t|      285K      \n| Change Numeric Literal | 33K\t|   97K\t|      275K \n| More Specific If  | 16K\t|   68K\t|      121K\n| Add Method Call  | 17K \t|   60K\t|      118K          \t\n| Add Elements To Iterable | 15K \t|   57K\t|      175K\n| Same Function Less Args | 14K\t|   50K\t|      169K     \n| Change Boolean Literal | 13K \t|   37K\t|      82K\n| Add Attribute Access | 10K \t|   32K\t|      74K\n| Change Binary Operator | 9K\t|   29K\t|      71K\n| Same Function Wrong Caller | 8K\t|   25K\t|      46K\n| Less Specific If  | 5K \t|   22K\t|      45K\n| Change Keyword Argument Used | 6K \t|   20K\t|      59K\n| Change Unary Operator  | 4K\t|   15K\t|      23K\n| Same Function Swap Args | 2K\t|   8K\t|      77K\n| Change Constant Type | 2K\t|   6K\t|      12K                   \n  \n\nNonSStuB Statistic:\nPattern Name\t| CTSSB-1M |\tTSSB-3M|\tSSB-9M     \n----------------|----------------|----------------|-----------------------\nSingle Statement| 333K |   1.15M      | 3.37M\nSingle Token    | 220K |   740K       | 2.2M\n\n### JSON fields\nThe released dataset indexes up to 28 million single statement change commits from more than 460K git projects. All dataset entries are stored in a compressed [jsonlines](https://jsonlines.org) format. Because of size of the dataset, we sharded the dataset in files containing 100.000 commits each. Each entry does not only contain information to access the original source code but also information supporting basic analyses. A description of the stored json objects is given in the following:\n\n**Commit details:**\n- **project:** Name of the git project where the commit occurred.\n- **project_url:** URL of project containing the commit\n- **commit_sha:** commit SHA of the code change\n- **parent_sha:** commit SHA of the parent commit\n- **file_path:** File path of the changed source file\n- **diff:** Universal diff describing the change made during the commit\n- **before:** Python statement before commit\n- **after:** Python statement after commit (addresses the same line)\n\n**Commit analysis:**\n- **likely_bug:** `true` if the commit message indicates that the commit is a bug fix. This is heuristically determined.\n- **comodified:** `true` if the commit modifies more than one statement in a single file (formatting and comments are ignored).\n- **in_function:** `true` if the changed statement appears inside a Python function\n- **sstub_pattern:** the name of the single statement change pattern the commit can be classified for (if any). Default: `SINGLE_STMT`\n- **edit_script:** A sequence of AST operation to transform the code before the commit to the code after the commit (includes `Insert`, `Update`, `Move` and `Delete` operations).\n\n\n## Mining Process\nTo mine software repositories for millionth of single\nstatement bugs, we developed multiple scripts for mining and filtering the datasets. We describe them in the following in the order which they should be executed:\n\n`run_batch_crawler.py`: A script to mine a batch of Git repositories. The crawler will sequentially checkout each repository and then search the Git history for single line edits\n```bash\n$ python run_batch_crawler.py [--compress] [index_file] [output_dir]\n```\nThe index file should be file with a list of Git repository urls. Output dir is the directory where mining results are saved to. Optionally, the script can save results into compressed files to save disk space.\n\n`convert_to_jsonl_gz.py`: Can be skipped if only one batch crawler was used. This script can be employed to collect all files produced by the batch crawler and save them in a single directory containing compressed jsonl files.\n\n`run_deduplication.py`: Filters the dataset for duplicate entries (based on project name, commit hash and file difference).\n\n`run_slc_process.py`: Filter a given collection\nof single line edits for single line changes (without any other code modifications). In addition, this also identifies potential SStuB paterns\nand computes the edit script.\n\n`rm_parse_errors.py`: Remove all entries where the diff could not be parsed.\n\n`rm_nostmt.py`: Remove all entries that are not single\nstatement changes.\n\nAfter running, `rm_nostmt.py` were are now performed the necessary steps to create **SSC-28M**.\n\n`rm_nobug.py`: Remove all entries which are not likely\nrelated to a bug. Bug fixes are identifed heuristically by checking the commit message for certain keywords. The strategy has been proven to be highly precise.\n\nAfter running, `rm_nobug.py` were are now performed the necessary steps to create **SSB-9M**.\n\n`rm_comodified.py`: Remove all entries that belong\nto commits that modify more than one statement. Bug fixes are often tangled with non-fixing code changes. To avoid mining the tangled changes, we remove all bug-fixes that modifiy more than one statement.\n\nAfter running, `rm_comodified.py` were are now performed the necessary steps to create **TSSB-3M**.\n\nThe initial mining process (`run_batch_crawler.py`) used repository urls extracted from Libraries.io 1.6 and were performed on a cluster for two weeks. After mining, the remaining steps were performed on a single machine.\n\nIn addition to the scripts necessary for mining our datasets, we additionally provide scripts for analyzing the generated datasets:\n\n`stats.py`: Collects statistics over the dataset. Statistics include number of commits, number of projects, SStuB pattern distribution, distribution of central AST edit operations.\n\n`compute_edit_patterns.py`: For each bug fix transform\nthe AST edit script into an edit pattern. The translation converts for example inserting a binary operator into an assignment as `Insert(binary_op, assign)`.\n\n`compute_pattern_distance.py`: For each pattern, compute smallest jaccard distance to a bug fix classified as a SStuB.\n\n`typo_identification.py`: Computes the percentage\nof bug fixing commits that can be likely attributed to typos. Code changes are considered as typo fixes whenever the Damerau-Levenshtein distance between bug and fix is lower equal 2.\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcedricrupb%2FTSSB3M","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcedricrupb%2FTSSB3M","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcedricrupb%2FTSSB3M/lists"}