Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rg089/scaning
[CIKM'23] Code and data for our paper 'James ate 5 oranges = Steve bought 5 pencils': Structure-Aware Denoising for Paraphrasing Word Problems
https://github.com/rg089/scaning
Last synced: about 1 month ago
JSON representation
[CIKM'23] Code and data for our paper 'James ate 5 oranges = Steve bought 5 pencils': Structure-Aware Denoising for Paraphrasing Word Problems
- Host: GitHub
- URL: https://github.com/rg089/scaning
- Owner: rg089
- Created: 2022-06-24T18:52:35.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2023-08-19T06:17:58.000Z (over 1 year ago)
- Last Synced: 2023-08-19T07:32:08.511Z (over 1 year ago)
- Language: Python
- Homepage:
- Size: 4.27 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SCANING
The source code and data for '*James ate 5 oranges = Steve bought 5 pencils*: Structure-Aware Denoising for Paraphrasing Word Problems', accepted as a full paper at CIKM'23.
## Instructions to run the code:
### Setup:
- Install the requirements
```
$ pip install -r requirements.txt
```
- Add the semantic similarity model in the code folder, and download glove.6B.300d.txt into `code/data`.### NOTE:
For all the commands below, please replace `${X=Z}` with the appropriate arguments. Here, `X` represents the meaning of the argument, and `Z` (wherever specified) represents the default value used in the experiments.### 1. Generating the noised training set:
```
$ CUDA_VISIBLE_DEVICES=${GPU_ID} python corrupt.py \
-i ${TRAIN_DATA}\
-o ${OUTPUT_DATA}\
-n ${NUM_CORRUPTIONS=2}\
-p ${PASSIVE_FRAC=0.3}\
-ic ${INPUT_COL=question}\
-pe ${PRINT_EVERY=100}\
-se ${SAVE_EVERY=1000}\
-pln ${PRESERVE_LAST_N=4}\
-t\
-lp ${LOG_PATH}\
-cf ${CORRUPTION_FILE=corruptions_v10.json}
```Do this for both the training and validation set.
### 2. Learning $\Delta$:
```
$ CUDA_VISIBLE_DEVICES=${GPU_ID} python train_generation.py\
--train_path ${TRAIN_NOISE_DATA}\
--val_path ${VAL_NOISE_DATA}\
--model_path ${MODEL=facebook/bart-base}\
--tokenizer_path ${MODEL=facebook/bart-base}\
--metric_name rouge\
--save_path ${SAVE_PATH}\
--save_model_path ${SAVE_MODEL_PATH}\
--lr ${LR=8e-5}\
--batch_size ${BATCH_SIZE=32}\
--epochs ${EPOCHS=15}\
--max_input_length ${MAX_LENGTH=256}\
--max_target_length ${MAX_LENGTH=256}\
--prefix_col ${PREFIX_COL=prefix}\
--input_col ${INPUT_COL=corruption}\
--output_col ${OUTPUT_COL=target}
```### 3. Generating the inference noised set:
```
$ CUDA_VISIBLE_DEVICES=${GPU_ID} python corrupt.py \
-i ${TRAIN_DATA}\
-o ${OUTPUT_DATA}\
-n ${NUM_CORRUPTIONS=4}\
-p ${PASSIVE_FRAC=0.3}\
-ic ${INPUT_COL=question}\
-pe ${PRINT_EVERY=100}\
-se ${SAVE_EVERY=1000}\
-pln ${PRESERVE_LAST_N=4}\
-lp ${LOG_PATH}\
-cf ${CORRUPTION_FILE=corruptions_v10.json}$ python add_prompts_train.py \
--input ${OUTPUT_DATA}\
--output ${OUTPUT_DATA}\
--prefix_col "prefix"\
--log_col "log"\
--prompt_col "prompt"
```### 4. Running $\Delta$ on the inference noise:
```
$ CUDA_VISIBLE_DEVICES=${GPU_ID} python model_generation.py \
-m ${SAVED_DENOISER_PATH} \
-i ${OUTPUT_DATA}\
-o ${OUTPUT_DATA_GENERATED}\
-b ${BATCH_SIZE=32}\
-n 3\
-ic ${INPUT=corruption}\
-pc ${PREFIX=prefix}\
-oc ${OUTPUT=denoiser_generated}\
-nbg ${NUM_BEAM_GROUPS=3}
```### 5. Scoring, Filtering and Selecting:
```
$ CUDA_VISIBLE_DEVICES=${GPU_ID} python score.py \
--sim_model_path ${SIM_MODEL=ParaQD_v3.1} \
--input_file ${OUTPUT_DATA_GENERATED}\
--output_file ${OUTPUT_DATA_SCORED}\
--original_col ${ORIGINAL_COL=original}\
--candidate_col ${CANDIDATE_COL=denoiser_generated}$ python selecting.py \
--input_file ${OUTPUT_DATA_SCORED}\
--output_file ${PHASE2_TRAIN}\
--original_col ${ORIGINAL_COL=original}\
--candidate_col ${CANDIDATE_COL=denoiser_generated}\
--top_n ${TOP_N=2}\
--num_corruptions ${NUM_CORR=12}\
--sim_thresh ${SIM_THRESH=0.9}\
--bleu_thresh ${BLEU_THRESH=0.2}\
--wpd_thresh ${WPD_THRESH=0.15}\
--diversity_thresh ${DIVERSITY_THRESH=0.2}\
--lam ${LAMBDA=0.65}\
--prompt_consistency_filtering
```
### 6. Generating validation set for $\Psi$- Running 3,4,5 on the training file generates the training set for $\Psi$
- Run 3,4,5 again on the original validation file to generate the validation set for $\Psi$### 7. Training $\Psi$:
```
$ CUDA_VISIBLE_DEVICES=${GPU_ID} python train_generation.py \
--train_path ${PHASE2_TRAIN}\
--val_path ${PHASE2_VAL}\
--model_path {MODEL=facebook/bart-base} \
--tokenizer_path {MODEL=facebook/bart-base} \
--metric_name rouge \
--save_path ${SAVE_PATH}\
--save_model_path ${SAVE_MODEL_PATH}\
--lr ${LR=8e-5}\
--batch_size ${TRAIN_BATCH_SIZE=32}\
--epochs ${EPOCHS=15}\
--max_input_length ${MAX_LENGTH=256}\
--max_target_length ${MAX_LENGTH=256}\
--prefix_col ${PREFIX=prompt}\
--input_col ${ORIGINAL_COL=original}\
--output_col ${OUTPUT_COL=paraphrase}
```### 8. Evaluating $\Delta$:
Repeat 3 and 4 on the test set. Then, run:
```
$ CUDA_VISIBLE_DEVICES=${GPU_ID} python score.py \
--sim_model_path ${SIM_MODEL=ParaQD_v3.1} \
--input_file ${CORRUPTED_GENERATED}\
--output_file ${CORRUPTED_SCORES}\
--original_col ${ORIGINAL_COL=original}\
--candidate_col ${CANDIDATE_COL=denoiser_generated}\
--final$ python get_results.py\
--input_file ${CORRUPTED_SCORES} \
--output_file ${RESULTS} \
--method_name "denoiser" \
--add
```### Evaluating $\Psi$:
On the test file, run the following commands:
```
$ python add_prompts_test_selected.py \
--input_file ${TEST_FILE}\
--prompt_file ${PROMPT_FILE=corruption/helper/test_prompts.json}\
--output_file ${PHASE2_PROMPTS}\
--use_prompts$ CUDA_VISIBLE_DEVICES=${GPU_ID} python model_generation.py \
-m models/reconstruction_bart_v${VERSION}_phase2_model \
-i ${PHASE2_PROMPTS}\
-o ${PHASE2_GENERATED}\
-b ${BATCH_SIZE}\
-n ${NUM=3}\
-ic ${INPUT=question}\
-pc ${PREFIX=prompt}\
-oc ${OUTPUT=paraphraser_output}\
-nbg 3$ CUDA_VISIBLE_DEVICES=${GPU_ID} python score.py \
--sim_model_path ${SIM_MODEL=ParaQD_v3.1} \
--input_file ${PHASE2_GENERATED}\
--output_file ${PHASE2_SCORES}\
--original_col ${ORIGINAL=question}\
--candidate_col ${CANDIDATE=paraphraser_output}\
--final$ python get_results.py\
--input_file ${PHASE2_SCORES} \
--output_file ${RESULTS} \
--method_name "paraphraser" \
--add
```Thanks!