https://github.com/amazon-science/efficient-longdoc-classification
https://github.com/amazon-science/efficient-longdoc-classification
Last synced: 11 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/amazon-science/efficient-longdoc-classification
- Owner: amazon-science
- License: apache-2.0
- Created: 2022-07-14T21:50:05.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2022-08-01T14:12:03.000Z (almost 4 years ago)
- Last Synced: 2025-05-03T11:35:53.406Z (about 1 year ago)
- Language: Python
- Size: 20.5 KB
- Stars: 44
- Watchers: 2
- Forks: 10
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
## Source codes for ``Efficient Classification of Long Documents Using Transformers''
Please refer to our paper for more details and cite our paper if you find this repo useful:
```
@inproceedings{park-etal-2022-efficient,
title = "Efficient Classification of Long Documents Using Transformers",
author = "Park, Hyunji and
Vyas, Yogarshi and
Shah, Kashif",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-short.79",
doi = "10.18653/v1/2022.acl-short.79",
pages = "702--709",
}
```
## Instructions
### 1. Install required libraries
```
pip install -r requirements.txt
python -m spacy download en_core_web_sm
```
### 2. Prepare the datasets
#### Hyperpartisan News Detection
* Available at
* Download the datasets
```
mkdir data/hyperpartisan
wget -P data/hyperpartisan/ https://zenodo.org/record/1489920/files/articles-training-byarticle-20181122.zip
wget -P data/hyperpartisan/ https://zenodo.org/record/1489920/files/ground-truth-training-byarticle-20181122.zip
unzip data/hyperpartisan/articles-training-byarticle-20181122.zip -d data/hyperpartisan
unzip data/hyperpartisan/ground-truth-training-byarticle-20181122.zip -d data/hyperpartisan
rm data/hyperpartisan/*zip
```
* Prepare the datasets with the resulting xml files and this preprocessing script (following [Longformer](https://arxiv.org/abs/2004.05150)):
#### 20NewsGroups
* Originally available at
* Running `train.py` with the `--data 20news` flag will download and prepare the data available via `sklearn.datasets` (following [CogLTX](https://proceedings.neurips.cc/paper/2020/file/96671501524948bc3937b4b30d0e57b9-Paper.pdf)).
We adopt the train/dev/test split from [this ToBERT paper](https://ieeexplore.ieee.org/document/9003958).
#### EURLEX-57K
* Available at
* Download the datasets
```
mkdir data/EURLEX57K
wget -O data/EURLEX57K/datasets.zip http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/datasets.zip
unzip data/EURLEX57K/datasets.zip -d data/EURLEX57K
rm data/EURLEX57K/datasets.zip
rm -rf data/EURLEX57K/__MACOSX
mv data/EURLEX57K/dataset/* data/EURLEX57K
rm -rf data/EURLEX57K/dataset
wget -O data/EURLEX57K/EURLEX57K.json http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/eurovoc_en.json
```
* Running `train.py` with the `--data eurlex` flag reads and prepares the data from `data/EURLEX57K/{train, dev, test}/*.json` files
* Running `train.py` with the `--data eurlex --inverted` flag creates Inverted EURLEX data by inverting the order of the sections
* `data/EURLEX57K/EURLEX57K.json` contains label information.
#### CMU Book Summary Dataset
* Available at
```
wget -P data/ http://www.cs.cmu.edu/~dbamman/data/booksummaries.tar.gz
tar -xf data/booksummaries.tar.gz -C data
```
* Running `train.py` with the `--data books` flag reads and prepares the data from `data/booksummaries/booksummaries.txt`
* Running `train.py` with the `--data books --pairs` flag creates Paired Book Summary by combining pairs of summaries and their labels
### 3. Run the models
```
e.g. python train.py --model_name bertplusrandom --data books --pairs --batch_size 8 --epochs 20 --lr 3e-05
```
cf. Note that we use the source code for the CogLTX model:
### Hyperparameters used
#### Hyperpartisan
| Parameter | BERT | BERT+TextRank | BERT+Random | Longformer | ToBERT |
|------------|-------|---------------|-------------|---------------------------------------------------|--------|
| Batch size | 8 | 8 | 8 | 16 | 8 |
| Epochs | 20 | 20 | 20 | 20 | 20 |
| LR | 3e-05 | 3e-05 | 5e-05 | 5e-05 | 5e-05 |
| Scheduler | NA | NA | NA | [warmup](https://arxiv.org/abs/2004.05150) | NA |
#### 20NewsGroups, Book Summary, Paired Book Summary
| Parameter | BERT | BERT+TextRank | BERT+Random | Longformer | ToBERT |
|------------|-------|---------------|-------------|---------------------------------------------------|--------|
| Batch size | 8 | 8 | 8 | 16 | 8 |
| Epochs | 20 | 20 | 20 | 20 | 20 |
| LR | 3e-05 | 3e-05 | 3e-05 | 0.005 | 3e-05 |
| Scheduler | NA | NA | NA | [warmup](https://arxiv.org/abs/2004.05150) | NA |
#### EURLEX, Inverted EURLEX
| Parameter | BERT | BERT+TextRank | BERT+Random | Longformer | ToBERT |
|------------|-------|---------------|-------------|---------------------------------------------------|--------|
| Batch size | 8 | 8 | 8 | 16 | 8 |
| Epochs | 20 | 20 | 20 | 20 | 20 |
| LR | 5e-05 | 5e-05 | 5e-05 | 0.005 | 5e-05 |
| Scheduler | NA | NA | NA | [warmup](https://arxiv.org/abs/2004.05150) | NA |