Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/szhang42/Uneven_training_data
https://github.com/szhang42/Uneven_training_data
Last synced: 7 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/szhang42/Uneven_training_data
- Owner: szhang42
- License: apache-2.0
- Created: 2021-09-02T17:19:14.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2022-01-03T21:57:07.000Z (almost 3 years ago)
- Last Synced: 2024-07-10T09:52:57.958Z (4 months ago)
- Language: Python
- Size: 3.3 MB
- Stars: 9
- Watchers: 1
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Learning with Different Amounts of Annotation: From Zero to Many Labels
This is the official GitHub repository for the following paper:
> [Learning with Different Amounts of Annotation: From Zero to Many Labels.](https://arxiv.org/abs/2109.04408)
> Shujian Zhang, Chengyue Gong, and Eunsol Choi.
> _EMNLP_, 2021.If using, please cite:
```
@article{zhang2021learning,
title={Learning with different amounts of annotation: From zero to many labels},
author={Zhang, Shujian and Gong, Chengyue and Choi, Eunsol},
journal={arXiv preprint arXiv:2109.04408},
year={2021}
}```
### Pre-requisites
This repository is based on the [HuggingFace Transformers](https://github.com/huggingface/transformers) library.
### Setup
Install packages:```
pip install -r requirements.txt
```### Train GLUE-style model
To train a GLUE-style model using this repository:
```
TASK=snli
MODEL_OUTPUT_DIR=./model_output_snli_xxx/
Setting=xxx
Tra=xxx
Ft=xxxpython -m cartography.classification.run_glue \
-c configs/$TASK.jsonnet \
--do_train \
--do_eval \
--do_finetune \
--num_train_epochs $Tra \
--ft_num_train_epochs $Ft \
--label_propagation \
--setting $Setting \
--overwrite_output_dir \
--overwrite_cache \
-o $MODEL_OUTPUT_DIR```
The best configurations for our experiments for each of the `$TASK`s (SNLI and MNLI) are provided under [configs](./configs).### Data Selection
To select (different amounts of) data based on various metrics from training dynamics:
```
python -m cartography.selection.train_dy_filtering \
--filter \
--task_name $TASK \
--model_dir $PATH_TO_MODEL_OUTPUT_DIR_WITH_TRAINING_DYNAMICS \
--metric $METRIC \
--data_dir $PATH_TO_GLUE_DIR_WITH_ORIGINAL_DATA_IN_TSV_FORMAT
```Supported `$TASK`s include SNLI, QNLI, MNLI and WINOGRANDE, and `$METRIC`s include `confidence`, `variability`, `correctness`, `forgetfulness` and `threshold_closeness`; see [paper](https://arxiv.org/abs/2009.10795) for more details.
To select _hard-to-learn_ instances, set `$METRIC` as "confidence" and for _ambiguous_, set `$METRIC` as "variability". For _easy-to-learn_ instances: set `$METRIC` as "confidence" and use the flag `--worst`.