https://github.com/zphang/lrqa
https://github.com/zphang/lrqa
Last synced: 12 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/zphang/lrqa
- Owner: zphang
- Created: 2021-05-24T23:07:02.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2023-03-14T06:06:23.000Z (about 3 years ago)
- Last Synced: 2025-04-02T10:51:23.266Z (about 1 year ago)
- Language: Python
- Size: 51.8 KB
- Stars: 7
- Watchers: 3
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# QuALITY training code (LRQA)
This LRQA repository contains code for running baselines on the QuALITY data.
## Update
- [Mar 2020]: Updated preprocessing scripts for QuALITY v1.0.1
## Overview
This repository contains code to support experiments on long-document QA models. Currently it just works as a multiple-choice experiment repository.
### Tasks
LRQA supports either existing multiple-choice datasets (e.g. Cosmos QA, HellaSwag), or custom task datasets for quick iteration. In all cases, the goal is to map different datasets to a standard format of:
```
{
"context": ...
"query": ...
"option_0" ...
..
"option_N" ...
"label": ...
}
```
Where the data will be formatted into inputs as:
```
[CLS] context [SEP] query option
```
The strings are concatenated with *no spaces*. Hence, it is recommended to prepend spaces to option and query values. (Formatting may differ based on different tokenizers.)
Converters to this standard format are available for a set of multiple-choice tasks (see: `lrqa/tasks.py`):
- Cosmos QA
For custom tasks, you can provide a path to a folder containing files such as
```
config.json
train.jsonl
validation.jsonl
test.jsonl
```
The individual phases are optional. Each phase is contained in a single `jsonl` file, with the keys as specified above. `config.json` specifies some configurations for the task, such as the number of choices. See `lrqa.tasks.CustomJSONLTask` for more details. See `./resources/example_jsonl_task` for an example.
### Models
Models can be broken down into 2 categories:
**Encoder-based** models are based on `transformers.AutoModelForMultipleChoice`. All auto-modals should be compatible. These need to be fine-tuned, though tuned models can be applied across tasks.
Note: Hugging Face's implementation runs `.forward` on all options at once, which can increases ram consumption. Adjust batch sizes and gradient accumulation steps accordingly.
**Generation-based** models are based on `transformers.AutoModelForCausalLM`. All auto-modals should be compatible. These can be run zero-shot, as the LM heads can be used directly to score continuations.
Encoder-Decoder support is coming soon!
## Usage setup
It is recommended to add the path to this repository to your `PYTHONPATH`, e.g.
```bash
export PYTHONPATH=/path/to/lrqa/:${PYTHONPATH}
```
Install requirements as necessary from `requirements.txt`. It is also recommended though not necessary to run code from within this folder.
## Sample usage
#### Training + Evaluating an Encoder
```bash
EXPDIR=/path/to/experiment
python lrqa/run_lrqa.py \
--model_name_or_path roberta-base \
--model_mode mc \
--task_name cosmosqa \
--output_dir ${EXPDIR} \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--do_train \
--do_eval \
--evaluation_strategy steps \
--eval_steps 500 \
--save_strategy no \
--load_best_model_at_end \
--num_train_epochs 1
```
#### Evaluating a Generator
```bash
EXPDIR=/path/to/experiment
python lrqa/run_lrqa.py \
--model_name_or_path gpt2 \
--model_mode generation \
--task_name cosmosqa \
--output_dir ${EXPDIR} \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--do_eval \
--evaluation_strategy steps \
--eval_steps 500 \
--save_strategy no \
--load_best_model_at_end \
--num_train_epochs 1
```
#### Evaluating a Generator on a custom task
```bash
EXPDIR=/path/to/experiment
python lrqa/run_lrqa.py \
--model_name_or_path gpt2 \
--model_mode generation \
--task_name custom \
--task_base_path ./resources/example_jsonl_task \
--output_dir ${EXPDIR} \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--do_eval \
--evaluation_strategy steps \
--eval_steps 500 \
--save_strategy no \
--load_best_model_at_end \
--num_train_epochs 1
```
## Extractive Preprocessing
Use the HTML-cleaned version of the QuALITY data. You can run the preprocessing for generating the extractive baseline inputs as follows:
```bash
python lrqa/scripts/extraction.py \
--input_base_path /path/to/input/data \
--output_base_path /path/to/output/data \
--scorer rouge \
--query_type question
```
This will prepare the data in a format that is suitable for the above `run_lrqa` scripts.
To use fastText, you will first need to download fastText embeddings from [here](https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip), and then run the following script:
```bash
python lrqa/scripts/save_fasttext_embeddings.py \
--fasttext_data_path /path/to/crawl-300d-2M.vec \
--output_path /path/to/fasttext_embeddings.p
```
## Applications
### Applying to BBQ
Here is some sample code for running evaluation on [BBQ](https://arxiv.org/abs/2110.08193).
First, we preprocess (just writing to standardized JSONL format) and fine-tune on RACE.
```bash
EXP_FOL=...
HF_MODEL_NAME=roberta-base
BATCH_SIZE=8
GRAD_ACCUM_STEPS=4
python lrqa/scripts/race_preproc.py \
--data_path ${EXP_FOL}/race
python lrqa/run_lrqa.py \
--model_name_or_path ${HF_MODEL_NAME} \
--model_mode mc \
--max_seq_length 512 \
--task_name custom \
--task_base_path ${EXP_FOL}/race \
--output_dir ${EXP_FOL}/race_run \
--learning_rate 1e-5 \
--num_train_epochs 3 \
--warmup_ratio 0.1 \
--eval_steps 1000 \
--save_steps 1000 \
--save_total_limit 5 \
--save_strategy steps \
--evaluation_strategy steps \
--load_best_model_at_end \
--per_device_train_batch_size ${BATCH_SIZE} \
--per_device_eval_batch_size ${BATCH_SIZE} \
--gradient_accumulation_steps ${GRAD_ACCUM_STEPS} \
--do_train --do_eval --do_predict --predict_phases validation
```
Next we preprocess all the BBQ data (cloned from https://github.com/nyu-mll/BBQ), and run evaluation on each of the categories using fine-tuned models.
```bash
BBQ_DATA=... # clone BBQ repo and point to `data` folder
python lrqa/scripts/bbq_preproc.py \
--input_data_path=${BBQ_DATA} \
--data_path ${EXP_FOL}/bbq
for CATEGORY in Age Disability_status Gender_identity Nationality Physical_appearance Race_ethnicity Race_x_SES Race_x_gender Religion SES Sexual_orientation; do
python lrqa/run_lrqa.py \
--model_name_or_path ${EXP_FOL}/race_run/checkpoint-last \
--model_mode mc \
--max_seq_length 512 \
--task_name custom \
--task_base_path ${EXP_FOL}/bbq/${CATEGORY}/ \
--output_dir ${EXP_FOL}/bbq_runs/${CATEGORY}/ \
--per_device_eval_batch_size ${BATCH_SIZE} \
--do_eval --do_predict --predict_phases validation
done
```
The predictions should be stored in a PyTorch pickle at `${EXP_FOL}/bbq_runs/${CATEGORY}/validation_predictions.p`.