{"id":13716213,"url":"https://github.com/google-research/plur","last_synced_at":"2025-05-07T05:32:25.729Z","repository":{"id":42446380,"uuid":"434355502","full_name":"google-research/plur","owner":"google-research","description":"PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the datasets. This is done by offering a unified API and data structures for all datasets.","archived":true,"fork":false,"pushed_at":"2022-04-05T18:50:44.000Z","size":187,"stargazers_count":88,"open_issues_count":8,"forks_count":17,"subscribers_count":11,"default_branch":"main","last_synced_at":"2024-11-14T04:34:39.353Z","etag":null,"topics":["deep-learning","machine-learning","program-synthesis","research","software-engineering"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-12-02T19:52:31.000Z","updated_at":"2024-03-25T05:19:16.000Z","dependencies_parsed_at":"2022-08-12T10:00:40.031Z","dependency_job_id":null,"html_url":"https://github.com/google-research/plur","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fplur","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fplur/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fplur/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fplur/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google-research","download_url":"https://codeload.github.com/google-research/plur/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252823183,"owners_count":21809702,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","machine-learning","program-synthesis","research","software-engineering"],"created_at":"2024-08-03T00:01:08.146Z","updated_at":"2025-05-07T05:32:25.333Z","avatar_url":"https://github.com/google-research.png","language":"Python","readme":"# PLUR\n\nPLUR (Programming-Language Understanding and Repair) is a collection of\nsource code datasets suitable for graph-based machine learning. We provide\nscripts for downloading, processing, and loading the datasets. This is done\nby offering a unified API and data structures for all datasets.\n\n\n## Installation\n\n```bash\nSRC_DIR=${PWD}/src\nmkdir -p ${SRC_DIR} \u0026\u0026 cd ${SRC_DIR}\n# For Cubert.\ngit clone https://github.com/google-research/google-research --depth=1\nexport PYTHONPATH=${PYTHONPATH}:${SRC_DIR}/google-research\ngit clone https://github.com/google-research/plur \u0026\u0026 cd plur\npython -m pip install -r requirements.txt\npython setup.py install\n```\n\n**Test execution on small dataset**\n\n```bash\ncd plur\npython3 plur_data_generation.py --dataset_name=manysstubs4j_dataset \\\n  --stage_1_dir=/tmp/manysstubs4j_dataset/stage_1 \\\n  --stage_2_dir=/tmp/manysstubs4j_dataset/stage_2 \\\n  --train_data_percentage=40 \\\n  --validation_data_percentage=30 \\\n  --test_data_percentage=30\n```\n\n## Usage\n\n### Basic usage\n\n#### Data generation (step 1)\n\nData generation is done by calling `plur.plur_data_generation.create_dataset()`.\nThe data generation runs in two stages:\n\n1. Convert raw data to `plur.utils.GraphToOutputExample`.\n2. Convert `plur.utils.GraphToOutputExample` to `TFExample`.\n\nStage 1 is unique for each dataset, but stage 2 is the same for almost all datasets.\n\n```python\nfrom plur.plur_data_generation import create_dataset\n\ndataset_name = 'manysstubs4j_dataset'\ndataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'\nstage_1_kwargs = dict()\ndataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'\nstage_2_kwargs = dict()\ncreate_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)\n```\n\n`plur_data_generation.py` also provides a command line interface, but it offers less flexibility.\n\n```bash\npython3 plur_data_generation.py --stage_1_dir=/tmp/manysstubs4j_dataset/stage_1 --stage_2_dir=/tmp/manysstubs4j_dataset/stage_2\n```\n\n#### Data loader (step 2)\n\nAfter the data is generated, you can use `PlurDataLoader` to load the data. The data loader loads `TFExample`s but returns them as numpy arrays.\n\n```python\nfrom plur.plur_data_loader import PlurDataLoader\nfrom plur.util import constants\n\ndataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'\nsplit = constants.TRAIN_SPLIT_NAME\nbatch_size = 32\nrepeat_count = -1\ndrop_remainder = True\ntrain_data_generator = PlurDataLoader(dataset_stage_2_directory, split, batch_size, repeat_count, drop_remainder)\n\nfor batch_data in train_data_generator:\n  # your training loop...\n```\n\n#### Training (step 3)\n\nThis is where users of the PLUR framework plug in their custom ML models and\ncode to train and generate predictions for PLUR tasks.\n\nWe provide the models for `GGNN`, `Transformer` and `GREAT` models from the PLUR\npaper. See below for sample commands. For the full set of command line FLAGS,\nsee `plur/model_design/train.py`.\n\n\n*Training*\n\n```bash\npython3 train.py \\\n --data_dir=/tmp/manysstubs4j_dataset/stage_2 \\\n --exp_dir=/tmp/experiments/exp12345\n```\n\n*Evaluation / Generating predictions*\n\n```bash\npython3 train.py \\\n --data_dir=/tmp/manysstubs4j_dataset/stage_2 \\\n --exp_dir=/tmp/experiments/exp12345 \\\n --evaluate=true\n```\n\n\n#### Evaluating (step 4)\n\nOnce the training is finished and you have generated natural text predictions on the test data, you can use `plur_evaluator.py` to evaluate the performance. `plur_evaluator.py` works in offline mode, meaning that it expects one or more files containing the ground truths, and matching files containing the predictions.\n\n```bash\npython3 plur_evaluator.py --dataset_name=manysstubs4j_dataset --target_file_pattern=/tmp/manysstubs4j_dataset/targets.txt --prediction_file_pattern=/tmp/manysstubs4j_dataset/predictions.txt\n```\n\nWhen using multiple evaluation \"rounds\", the evaluator may create multiple targets and predictions files, formatted as `...predictions-0-of-5.txt`; you can refer to all of these combined using a Glob file pattern such as `...predictions-?-of-5.txt` in the command above.\n\nFor more details about how `plur_evaluator` works see [`plur/eval/README.md`](./eval/README.md).\n\n\n### Transforming and filtering data\n\nIf there is something fundamental you want to change in the dataset, you should apply them in stage 1 of data generation, otherwise apply them in stage 2. The idea is that stage 1 should only be run once per dataset (to create the `plur.utils.GraphToOutputExample`), and stage 2 should be run each time you want to train on different data (to create the TFRecords).\n\nAll transformation and filtering functions are applied on `plur.utils.GraphToOutputExample`, see `plur.utils.GraphToOutputExample` for more information.\n\nE.g. a transformation that can be run in stage 1 is that your model expects that graphs in the dataset have no loop, and you write your transformation function to remove loops. This will ensure that stage 2 will read data where the graph has no loops.\n\nE.g. of filters that can be run in stage 2 is that you want to check your model performance on different graph sizes in terms of number of nodes. You write your own filter function to filter graphs with a large number of nodes.\n\n```python\nfrom plur.plur_data_generation import create_dataset\n\ndataset_name = 'manysstubs4j_dataset'\ndataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'\nstage_1_kwargs = dict()\ndataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'\ndef _filter_graph_size(graph_to_output_example, graph_size=1024):\n  return len(graph_to_output_example.get_nodes()) \u003c= graph_size\nstage_2_kwargs = dict(\n    train_filter_funcs=(_filter_graph_size,),\n    validation_filter_funcs=(_filter_graph_size,)\n)\ncreate_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)\n```\n\n### Advanced usage\n\n`plur.plur_data_generation.create_dataset()` is just a thin wrapper around `plur.stage_1.plur_dataset` and `plur.stage_2.graph_to_output_example_to_tfexample`.\n\n```python\nfrom plur.plur_data_generation import create_dataset\n\ndataset_name = 'manysstubs4j_dataset'\ndataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'\nstage_1_kwargs = dict()\ndataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'\nstage_2_kwargs = dict()\ncreate_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)\n```\n\nis equivalent to\n\n```python\nfrom plur.stage_1.manysstubs4j_dataset import ManySStubs4jJDataset\nfrom plur.stage_2.graph_to_output_example_to_tfexample import GraphToOutputExampleToTfexample\n\ndataset_name = 'manysstubs4j_dataset'\ndataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'\ndataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'\ndataset = ManySStubs4jJDataset(dataset_stage_1_directory)\ndataset.stage_1_mkdirs()\ndataset.download_dataset()\ndataset.run_pipeline()\n\ndataset = GraphToOutputExampleToTfexample(dataset_stage_1_directory, dataset_stage_2_directory, dataset_name)\ndataset.stage_2_mkdirs()\ndataset.run_pipeline()\n```\n\nYou can check out `plur.stage_1.manysstubs4j_dataset` for dataset specific arguments.\n```python\nfrom plur.stage_1.manysstubs4j_dataset import ManySStubs4jJDataset\n\ndataset_name = 'manysstubs4j_dataset'\ndataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'\n\ndataset = ManySStubs4jJDataset(dataset_stage_1_directory, dataset_size='large')\ndataset.stage_1_mkdirs()\ndataset.download_dataset()\ndataset.run_pipeline()\n```\n\n## Adding a new dataset\n\nAll datasets should inherit `plur.stage_1.plur_dataset.PlurDataset`, and placed under `plur/stage_1/`, which requires you to implement:\n\n* `download_dataset()`: Code to download the dataset, we provide `download_dataset_using_git()` to download from git and `download_dataset_using_requests()` to download from a URL, which also works with a Google Drive URL. In `download_dataset_using_git()` we download the dataset from a specific commit id. In `download_dataset_using_requests()` we check the sha1sum for the downloaded files. This is to ensure that the same version of PLUR downloads the same raw data.\n* `get_all_raw_data_paths()`: It should return a list of paths, where each path is a file containing the raw data in the datasets.\n* `raw_data_paths_to_raw_data_do_fn()`: It should return a `beam.DoFn` class that overrides `process()`. The `process()` should tell beam how to open the files returned by `get_all_raw_data_paths()`. It is also here we define if the data belongs to any split (train/validation/test).\n* `raw_data_to_graph_to_output_example()`: This function transforms raw data from `raw_data_paths_to_raw_data_do_fn()` to `GraphToOutputExample`.\n\nThen add/change the following lines in `plur/plur_data_generation.py`:\n\n```python\nfrom plur.stage_1.foo_dataset import FooDataset\n\nflags.DEFINE_enum(\n    'dataset_name',\n    'dummy_dataset',\n    (\n        'code2seq_dataset',\n        'convattn_dataset',\n        'dummy_dataset',\n        # [...]\n        'retrieve_and_edit_dataset',\n        'foo_dataset',\n    ),\n    'Name of the dataset to generate data.')\n\n# [...]\ndef get_dataset_class(dataset_name):\n  \"\"\"Get the dataset class based on dataset_name.\"\"\"\n  if dataset_name == 'code2seq_dataset':\n    return Code2SeqDataset\n  elif dataset_name == 'convattn_dataset':\n    return ConvAttnDataset\n  elif dataset_name == 'dummy_dataset':\n    return DummyDataset\n  # [...]\n  elif dataset_name == 'retrieve_and_edit_dataset':\n    return RetrieveAndEditDataset\n  elif dataset_name == 'foo_dataset':\n    return FooDataset\n  else:\n    raise ValueError('{} is not supported.'.format(dataset_name))\n```\n\n## Evaluation details\n\nThe details of how evaluation is performed are in [`plur/eval/README.md`](./eval/README.md).\n\n## License\n\nLicensed under the Apache 2.0 License.\n\n## Disclaimer\n\nThis is not an officially supported Google product.\n\n## Citation\n\nPlease cite the PLUR paper, Chen et al. https://proceedings.neurips.cc//paper/2021/hash/c2937f3a1b3a177d2408574da0245a19-Abstract.html\n","funding_links":[],"categories":["Data Sets and Benchmarks","Python"],"sub_categories":["Programming-Language Understanding and Repair"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Fplur","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle-research%2Fplur","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Fplur/lists"}