An open API service indexing awesome lists of open source software.

https://github.com/negrinho/beam_learn_supertagging

Code for "An Empirical Investigation of Beam-Aware Training in Supertagging" to appear in EMNLP Findings 2020
https://github.com/negrinho/beam_learn_supertagging

beam-search sequence-labeling structured-prediction supertagging

Last synced: 10 months ago
JSON representation

Code for "An Empirical Investigation of Beam-Aware Training in Supertagging" to appear in EMNLP Findings 2020

Awesome Lists containing this project

README

          

This repo contains the code to reproduce the results reported in the paper
[An Empirical Investigation of Beam-Aware Training in Supertagging](https://arxiv.org/abs/2010.04980) to appear in EMNLP Findings 2020.
This work explores how different choices for the meta-algorithm of [Negrinho et al (2018)](https://arxiv.org/abs/1811.00512), which appeared in NeurIPS 2018, affect performance in a sequence labelling task (namely, supertagging on [CCGBank](https://catalog.ldc.upenn.edu/LDC2005T13)).
The goal of this work was to explore when beam-aware training algorithms would soundly beat non-beam aware methods (e.g., the default approach of training on maximum likelihood and decoding with beam search).
We have found several conditions under which this is the case, e.g., in a simulated online setting where the model does not have access to the complete sentence for tagging and therefore must manage uncertainty about prediction effectively.
It is in these cases where we observe the largest performance differences to models that are not trained in a beam-aware manner, and therefore are bound to make unrecoverable mistakes resulting from their greediness.
By learning the model in a beam-aware manner, the model is able to learn to use the beam to manage uncertainty about future predictions until there is additional information to resolve the uncertainty.

Quickstart
----------

First, create a Conda environment to work on the project:
```
conda create --name beam_learn python=2.7
conda activate beam_learn
python -m pip install dynet==2.1
conda install psutil matplotlib paramiko
```

`main.py` is the main file containing the implementations of the algorithms.
`main.py` is ran with a JSON configuration file.
See below for the command to run a specific configuration file for training (`--train` flag; see `main.py` for other options, such as `--compute_vanilla_beam_accuracy` and `--compute_beam_accuracy` which are used to run vanilla beam search on a model trained with maximum likelihood, and to run beam search on a model that has been trained in a beam-aware manner, respectively).
```
python -u main.py --dynet-mem 4000 --dynet-autobatch 1 --train --config_filepath PATH_TO_CONFIGURATION_FILE
```

The training data must first be processed to the format expected by the code.
First download [CCGBank](https://catalog.ldc.upenn.edu/LDC2005T13)) from LDC (which requires access to LDC corpora, which your university might have a subscription for).
After downloading the files, uncompress them into the folder `data/ccgbank_1_1`.
After it is placed there, `main_preprocessing.py` can be ran to generate the data files needed for running the training code (i.e., `data/supertagging/train.jsonl`, `data/supertagging/dev.jsonl`, and `data/supertagging/test.jsonl`).
See [here](https://drive.google.com/file/d/1JoOEXbfYU8in5vJLsDZ4TvGpt8k3n31C/view?usp=sharing) for CONLL-2003 processed into this format for an example of how the resulting files should look like (due to licensing restrictions for the supertagging data).

While the code was developed for supertagging, it should be easy to adapt for any sequence labelling task where the input and output sequences have the same length.
The easiest way of accomplishing this is to process the data into the JSON line format (jsonl) which is used for the supertagging task.
We have included data processing scripts for CONLL-2000, CONLL-2003, and PTB in `dev/main_preprocessing.py`.

After the data is in place, the only step left to run `main.py` is to generate the JSON configuration files that were used for the experiments in the paper.
These configuration files will live in the `configs` folder.
The configuration files for the experiments in the paper are derived from a base configuration file `configs/cfgref.json` to reduce the amount of repetition and to make clear what aspects are being tested.
The contents of that file are:

```
{
"model_type": "vaswani",
"w_emb_dim": 64,
"t_emb_dim": 64,
"pos_emb_dim": 16,
"use_postags": 1,
"bilstm_h_dim": 256,
"lm_h_dim": 256,
"num_epochs": 16,
"step_size_schedule_type": "cosine",
"step_size_start": 0.1,
"step_size_end": 1e-5,
"weight_decay": 0.0,
"use_beam_bilstm": 0,
"use_beam_mlp": 0,
"accumulate_scores": 1,
"update_only_on_cost_increase": 0,
"print_every_num_examples": 8192,
"data_type": "supertagging",
"use_pretrained_embeddings": 0,
"loss_type": "log_neighbors",
"compute_train_acc": 1,
"debug": 0,
"num_debug": 1024,
"optimizer_type": "sgd",
"out_folder": "out/cfgref",
"beam_size": 1,
"traj_type": "continue"
}
```

This is the only config file that has been provided under version control in the repo.
The other config files are derived from this one through overlays.
These can be generated by running `main_experiments.py`.
For example, `configs/cfg3000.json`, which is one of these generated files, is as follows:
```
{
"loss_type": "log_neighbors",
"data_type": "supertagging",
"_overlays_": [
"configs/cfgref.json"
],
"out_folder": "out/cfg3000",
"beam_size": 1,
"traj_type": "continue",
"model_type": "vaswani"
}
```
The overlays are specified through the list with key `_overlays_`, which has a single one in this case.
Multiple repeats of the same configuration are achieved by having additional config files that overlay this config by only changing the `out_folder`, i.e., the folder to which the result of running the configuration file will be stored.
For example, `configs/cfg_r0_3000.json` for the first repetition of `configs/cfg3000.json`:
```
{
"out_folder": "out/cfg_r0_3000",
"_overlays_": [
"configs/cfg3000.json"
]
}
```
For the results in the paper, we used three repetitions.
The results of running of a training experiment are stored in the `out_folder` mentioned in the corresponding config, with the results being stored in a JSON file with various metrics for each epoch (e.g., `secs_per_epoch`, `train_acc`, and `dev_acc`).
The relevant files to check in this case are `checkpoint.json` (regenerated at the end of each epoch) and `results.json` (created at the end of training).

`utils.sh` is used to help offload the computation of these configs to a remote server.
In our workflow, we have used a SLURM managed cluster (namely, [Bridges](https://www.psc.edu/bridges)).
Using this code for Bridges with your own account or for another SLURM managed cluster should be a manner of changing the credentials in the file.
These utilities work best with an SSH key, which removes the need to input the password with each connection to the server.

Finally, after having all the results for the configs of the form `configs/cfg_r*_*.json`, the results reported in the paper can be generated by running `main_results.py`.

In summary, the steps to replicate the results in the paper are:
- Create the Conda environment with the required packages.
- Download [CCGBank](https://catalog.ldc.upenn.edu/LDC2005T13) from LDC, uncompress it, and place it in `data/ccgbank_1_1`.
- Process raw CCGBank data by running `main_preprocessing.py`, which will create new files in a `data/supertagging` folder.
- Generate the JSON configuration files by running `main_experiments.py`, which will create new files in `data/configs`.
- Run desired configuration files as described above, which will place the results in `out/$NAME_OF_CONFIG`.
- After all the desired experiments are finished, the results of the paper can be computed by running `main_results.py`, which assumes that the relevant log files are in the `out` folder.

While both the configs and the results can be generated by running the code as described, the configs (which can be generated by running `main_experiments.py`) and the results (which can be generated with `main_results.py` after running all the configs with `main.py`) can be found for reference [here](https://drive.google.com/file/d/1evju00TaDsINF3CSK9DJthGMKFmvZ5L4/view?usp=sharing) and [here](https://drive.google.com/file/d/19f2V2On30UlbvmjnSHaJkgEaKOYs-sHh/view?usp=sharing), respectively.

Citing this work
-----------------

If you use this code or build on the results of the this paper, please consider citing:
```
@inproceedings{negrinho2020empirical,
title={An Empirical Investigation of Beam-Aware Methods in Supertagging},
author={Negrinho, Renato and Gormley, Matthew and Gordon, Geoffrey},
booktitle={EMNLP Findings},
year={2020}
}

@inproceedings{negrinho2018learning,
title={Learning beam search policies via imitation learning},
author={Negrinho, Renato and Gormley, Matthew and Gordon, Geoffrey},
booktitle={Advances in Neural Information Processing Systems},
year={2018}
}
```

Acknowledgements
----------------

We gratefully acknowledge support from 3M | M*Modal.
This work used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).