{"id":31235687,"url":"https://github.com/gabeorlanski/stackoverflow-encourages-cheating","last_synced_at":"2025-09-22T14:59:10.654Z","repository":{"id":38372812,"uuid":"360158147","full_name":"gabeorlanski/stackoverflow-encourages-cheating","owner":"gabeorlanski","description":"Code for the NLP4Prog workshop paper \"Reading StackOverflow Encourages Cheating: Adding Question TextImproves Extractive Code Generation\"","archived":false,"fork":false,"pushed_at":"2021-08-10T21:23:35.000Z","size":3882,"stargazers_count":21,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-22T14:59:06.247Z","etag":null,"topics":["acl2021","machine-learning","natural-language-processing","nlp","nlp4prog","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gabeorlanski.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-21T12:16:17.000Z","updated_at":"2024-07-29T06:15:52.000Z","dependencies_parsed_at":"2022-08-25T02:11:44.864Z","dependency_job_id":null,"html_url":"https://github.com/gabeorlanski/stackoverflow-encourages-cheating","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gabeorlanski/stackoverflow-encourages-cheating","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gabeorlanski%2Fstackoverflow-encourages-cheating","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gabeorlanski%2Fstackoverflow-encourages-cheating/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gabeorlanski%2Fstackoverflow-encourages-cheating/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gabeorlanski%2Fstackoverflow-encourages-cheating/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gabeorlanski","download_url":"https://codeload.github.com/gabeorlanski/stackoverflow-encourages-cheating/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gabeorlanski%2Fstackoverflow-encourages-cheating/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":276422183,"owners_count":25639631,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-22T02:00:08.972Z","response_time":79,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["acl2021","machine-learning","natural-language-processing","nlp","nlp4prog","python"],"created_at":"2025-09-22T14:59:05.028Z","updated_at":"2025-09-22T14:59:10.641Z","avatar_url":"https://github.com/gabeorlanski.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"This is the repository for the\npaper [Reading StackOverflow Encourages Cheating: Adding Question TextImproves Extractive Code Generation](https://arxiv.org/abs/2106.04447)\n.\n\n![Our Approach](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/data/approach_figure.PNG)\n\n![Labeled Example](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/data/labeled_example.PNG)\n\n## Acknowledgements\n\nWe would like to thank Frank F. Xu and Pengcheng Yin for their helpful discussions and for sharing\ntheir code. Some code has come from the [TranX](https://github.com/pcyin/tranx)\nand [External Knowledge Codegen](https://github.com/neulab/external-knowledge-codegen) repositories.\n\nWe would also like to thank the work that inspired this one:\n\n[TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation](https://www.aclweb.org/anthology/D18-2002/)\nby Pengcheng Yin and Graham Neubig\n\n[Incorporating External Knowledge through Pre-training for Natural Language to Code Generation](https://www.aclweb.org/anthology/2020.acl-main.538/)\nby Frank F. Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, and Graham Neubig\n\n\n## TL;DR For Replication\n\nRun the Google colab\nfound [Notebook Link](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/BART_CG_Experiments.ipynb) [![Open Replication In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gabeorlanski/stackoverflow-encourages-cheating/blob/main/BART_CG_Experiments.ipynb)\nfor our best performing model.\n\nWe also provide all of the generated samples from our test with the\ninputs [here](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/data/generated.txt)\n.\n\nNote: It will take 1-2 (Maybe 3) hours to train and run on Google Colab\n\n## For working outside of colab\n\nYou need Python to use Python 3.8. I would recommend using a virtual environment.\n\n1. Install the requirements\n   from [`requirements.txt`](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/requirements.txt)\n\n```shell script\npip install -r requirements.txt\n```\n\n2. To run the model, run\n   the [`experiment.py`](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/experiment.py)\n   script. You can use `python experiment.py -h` or the documentation in the file to understand the\n   different options. But to use our best model, run\n\n```shell script\npython experiment.py best \"facebook/bart-base\" bartBase -combine-mined\n```\n\n3. Then in the `scratch` directory, you will find the results in a json file.\n\n## The data\n\n### Prepared Dataset:\n\n[Here](https://www.dropbox.com/s/xv3zcutli07w37w/base_dataset.zip?dl=0) is our dataset that we used.\n\n[This dataset](https://www.dropbox.com/s/glioprd0aly4381/cleaned_so_dataset.rar?dl=0) is the _cleaned_ data using the process we describe further down. **NOTE** For the time being this only includes 10,000 mined examples. It will be updated to include all cleaned mined examples.\n\nYou can find a sample schema for this\ndata [here](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/data/base_dataset_sample.json)\n.\n\nFor the `body` key, there are unclosed html tags in the text. *Eventually* these will be taken out.\nBut for now, the easy but bad solution is to use the regex `\u003c\\w+\u003e`. The good solution is to use\nthe [html tags file](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/data/html_tags.txt)\nto remove them. Note, you must surround the tag text with `\u003c \u003e`.\n\n### Parsed StackOverflow Data:\n\n[Link to the parsed StackOverflow Questions](https://www.dropbox.com/s/glioprd0aly4381/cleaned_so_dataset.rar?dl=0)\n\nFor actually working with this data:\n\n1. The JSON file has the structure:\n\n```json\n{\n    \"question_id\": {\n        \"question_id\": \"str\",\n        \"tags\": \"List[str]\",\n        \"title\": \"str\",\n        \"accepted_answer_id\": \"int or null\",\n        \"score\": \"int\",\n        \"body\": \"str\",\n        \"code_slots\": \"Ignore this, it is useless\",\n        \"answers\": {\n            \"answer_id\": {\n                \"score\": \"int\",\n                \"body\": \"str\",\n                \"code_slots\": \"Ignore\"\n            }\n        }\n    }\n}\n``` \n\n2. For the `body` key, there are unclosed html tags in the text. *Eventually* these will be taken\n   out. But for now, the easy but bad solution is to use the regex `\u003c\\w+\u003e`. The good solution is to\n   use\n   the [html tags file](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/data/html_tags.txt)\n   to remove them. Note, you must surround the tag text with `\u003c \u003e`.\n\n3. Finally, you must match the question ids from CoNaLa to the SO data.\n\n## References\n\nIf you use this dataset you MUST cite the [original CoNaLa paper](https://conala-corpus.github.io/) as well:\n\n```\n@misc{orlanski2021reading,\n      title={Reading StackOverflow Encourages Cheating: Adding Question Text Improves Extractive Code Generation}, \n      author={Gabriel Orlanski and Alex Gittens},\n      year={2021},\n      eprint={2106.04447},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n@inproceedings{yin2018mining,\n  author = {Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},\n  title = {Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow},\n  booktitle = {International Conference on Mining Software Repositories},\n  series = {MSR},\n  pages = {476--486},\n  year = {2018},\n  publisher = {ACM},\n  doi = {https://doi.org/10.1145/3196398.3196408},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgabeorlanski%2Fstackoverflow-encourages-cheating","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgabeorlanski%2Fstackoverflow-encourages-cheating","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgabeorlanski%2Fstackoverflow-encourages-cheating/lists"}