https://github.com/gabeorlanski/stackoverflow-encourages-cheating
Code for the NLP4Prog workshop paper "Reading StackOverflow Encourages Cheating: Adding Question TextImproves Extractive Code Generation"
https://github.com/gabeorlanski/stackoverflow-encourages-cheating
acl2021 machine-learning natural-language-processing nlp nlp4prog python
Last synced: 9 months ago
JSON representation
Code for the NLP4Prog workshop paper "Reading StackOverflow Encourages Cheating: Adding Question TextImproves Extractive Code Generation"
- Host: GitHub
- URL: https://github.com/gabeorlanski/stackoverflow-encourages-cheating
- Owner: gabeorlanski
- License: apache-2.0
- Created: 2021-04-21T12:16:17.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2021-08-10T21:23:35.000Z (almost 5 years ago)
- Last Synced: 2025-09-22T14:59:06.247Z (9 months ago)
- Topics: acl2021, machine-learning, natural-language-processing, nlp, nlp4prog, python
- Language: Python
- Homepage:
- Size: 3.7 MB
- Stars: 21
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
This is the repository for the
paper [Reading StackOverflow Encourages Cheating: Adding Question TextImproves Extractive Code Generation](https://arxiv.org/abs/2106.04447)
.


## Acknowledgements
We would like to thank Frank F. Xu and Pengcheng Yin for their helpful discussions and for sharing
their code. Some code has come from the [TranX](https://github.com/pcyin/tranx)
and [External Knowledge Codegen](https://github.com/neulab/external-knowledge-codegen) repositories.
We would also like to thank the work that inspired this one:
[TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation](https://www.aclweb.org/anthology/D18-2002/)
by Pengcheng Yin and Graham Neubig
[Incorporating External Knowledge through Pre-training for Natural Language to Code Generation](https://www.aclweb.org/anthology/2020.acl-main.538/)
by Frank F. Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, and Graham Neubig
## TL;DR For Replication
Run the Google colab
found [Notebook Link](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/BART_CG_Experiments.ipynb) [](https://colab.research.google.com/github/gabeorlanski/stackoverflow-encourages-cheating/blob/main/BART_CG_Experiments.ipynb)
for our best performing model.
We also provide all of the generated samples from our test with the
inputs [here](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/data/generated.txt)
.
Note: It will take 1-2 (Maybe 3) hours to train and run on Google Colab
## For working outside of colab
You need Python to use Python 3.8. I would recommend using a virtual environment.
1. Install the requirements
from [`requirements.txt`](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/requirements.txt)
```shell script
pip install -r requirements.txt
```
2. To run the model, run
the [`experiment.py`](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/experiment.py)
script. You can use `python experiment.py -h` or the documentation in the file to understand the
different options. But to use our best model, run
```shell script
python experiment.py best "facebook/bart-base" bartBase -combine-mined
```
3. Then in the `scratch` directory, you will find the results in a json file.
## The data
### Prepared Dataset:
[Here](https://www.dropbox.com/s/xv3zcutli07w37w/base_dataset.zip?dl=0) is our dataset that we used.
[This dataset](https://www.dropbox.com/s/glioprd0aly4381/cleaned_so_dataset.rar?dl=0) is the _cleaned_ data using the process we describe further down. **NOTE** For the time being this only includes 10,000 mined examples. It will be updated to include all cleaned mined examples.
You can find a sample schema for this
data [here](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/data/base_dataset_sample.json)
.
For the `body` key, there are unclosed html tags in the text. *Eventually* these will be taken out.
But for now, the easy but bad solution is to use the regex `<\w+>`. The good solution is to use
the [html tags file](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/data/html_tags.txt)
to remove them. Note, you must surround the tag text with `< >`.
### Parsed StackOverflow Data:
[Link to the parsed StackOverflow Questions](https://www.dropbox.com/s/glioprd0aly4381/cleaned_so_dataset.rar?dl=0)
For actually working with this data:
1. The JSON file has the structure:
```json
{
"question_id": {
"question_id": "str",
"tags": "List[str]",
"title": "str",
"accepted_answer_id": "int or null",
"score": "int",
"body": "str",
"code_slots": "Ignore this, it is useless",
"answers": {
"answer_id": {
"score": "int",
"body": "str",
"code_slots": "Ignore"
}
}
}
}
```
2. For the `body` key, there are unclosed html tags in the text. *Eventually* these will be taken
out. But for now, the easy but bad solution is to use the regex `<\w+>`. The good solution is to
use
the [html tags file](https://github.com/gabeorlanski/stackoverflow-encourages-cheating/blob/main/data/html_tags.txt)
to remove them. Note, you must surround the tag text with `< >`.
3. Finally, you must match the question ids from CoNaLa to the SO data.
## References
If you use this dataset you MUST cite the [original CoNaLa paper](https://conala-corpus.github.io/) as well:
```
@misc{orlanski2021reading,
title={Reading StackOverflow Encourages Cheating: Adding Question Text Improves Extractive Code Generation},
author={Gabriel Orlanski and Alex Gittens},
year={2021},
eprint={2106.04447},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{yin2018mining,
author = {Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},
title = {Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow},
booktitle = {International Conference on Mining Software Repositories},
series = {MSR},
pages = {476--486},
year = {2018},
publisher = {ACM},
doi = {https://doi.org/10.1145/3196398.3196408},
}
```