{"id":19932325,"url":"https://github.com/amazon-science/efficient-longdoc-classification","last_synced_at":"2025-07-21T08:32:04.070Z","repository":{"id":51820699,"uuid":"514057032","full_name":"amazon-science/efficient-longdoc-classification","owner":"amazon-science","description":null,"archived":false,"fork":false,"pushed_at":"2022-08-01T14:12:03.000Z","size":21,"stargazers_count":44,"open_issues_count":2,"forks_count":10,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-03T11:35:53.406Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amazon-science.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-07-14T21:50:05.000Z","updated_at":"2025-01-19T16:59:41.000Z","dependencies_parsed_at":"2022-08-23T08:01:49.772Z","dependency_job_id":null,"html_url":"https://github.com/amazon-science/efficient-longdoc-classification","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/amazon-science/efficient-longdoc-classification","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fefficient-longdoc-classification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fefficient-longdoc-classification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fefficient-longdoc-classification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fefficient-longdoc-classification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amazon-science","download_url":"https://codeload.github.com/amazon-science/efficient-longdoc-classification/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fefficient-longdoc-classification/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266267212,"owners_count":23902314,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T23:09:50.134Z","updated_at":"2025-07-21T08:32:04.052Z","avatar_url":"https://github.com/amazon-science.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Source codes for ``Efficient Classification of Long Documents Using Transformers''\n\nPlease refer to our paper for more details and cite our paper if you find this repo useful:\n\n```\n@inproceedings{park-etal-2022-efficient,\n    title = \"Efficient Classification of Long Documents Using Transformers\",\n    author = \"Park, Hyunji  and\n      Vyas, Yogarshi  and\n      Shah, Kashif\",\n    booktitle = \"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)\",\n    month = may,\n    year = \"2022\",\n    address = \"Dublin, Ireland\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2022.acl-short.79\",\n    doi = \"10.18653/v1/2022.acl-short.79\",\n    pages = \"702--709\",\n}\n```\n\n## Instructions\n\n### 1. Install required libraries\n\n```\npip install -r requirements.txt\npython -m spacy download en_core_web_sm\n```\n\n### 2. Prepare the datasets\n\n#### Hyperpartisan News Detection \n\n* Available at \u003chttps://zenodo.org/record/1489920#.YLferh1Olc8\u003e\n* Download the datasets\n\n```\nmkdir data/hyperpartisan\nwget -P data/hyperpartisan/ https://zenodo.org/record/1489920/files/articles-training-byarticle-20181122.zip\nwget -P data/hyperpartisan/ https://zenodo.org/record/1489920/files/ground-truth-training-byarticle-20181122.zip\nunzip data/hyperpartisan/articles-training-byarticle-20181122.zip -d data/hyperpartisan\nunzip data/hyperpartisan/ground-truth-training-byarticle-20181122.zip -d data/hyperpartisan\nrm data/hyperpartisan/*zip\n```\n  \n*  Prepare the datasets with the resulting xml files and this preprocessing script (following [Longformer](https://arxiv.org/abs/2004.05150)): \u003chttps://github.com/allenai/longformer/blob/master/scripts/hp_preprocess.py\u003e\n\n#### 20NewsGroups\n\n* Originally available at \u003chttp://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz\u003e\n* Running `train.py` with the `--data 20news` flag will download and prepare the data available via `sklearn.datasets` (following [CogLTX](https://proceedings.neurips.cc/paper/2020/file/96671501524948bc3937b4b30d0e57b9-Paper.pdf)).\nWe adopt the train/dev/test split from [this ToBERT paper](https://ieeexplore.ieee.org/document/9003958).\n  \n#### EURLEX-57K\n\n* Available at \u003chttps://github.com/iliaschalkidis/lmtc-emnlp2020\u003e\n* Download the datasets\n\n```\nmkdir data/EURLEX57K\nwget -O data/EURLEX57K/datasets.zip http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/datasets.zip\nunzip data/EURLEX57K/datasets.zip -d data/EURLEX57K\nrm data/EURLEX57K/datasets.zip\nrm -rf data/EURLEX57K/__MACOSX\nmv data/EURLEX57K/dataset/* data/EURLEX57K\nrm -rf data/EURLEX57K/dataset\nwget -O data/EURLEX57K/EURLEX57K.json http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/eurovoc_en.json\n```\n\n* Running `train.py` with the `--data eurlex` flag reads and prepares the data from `data/EURLEX57K/{train, dev, test}/*.json` files\n* Running `train.py` with the `--data eurlex --inverted` flag creates Inverted EURLEX data by inverting the order of the sections\n* `data/EURLEX57K/EURLEX57K.json` contains label information.\n\n#### CMU Book Summary Dataset\n\n* Available at \u003chttp://www.cs.cmu.edu/~dbamman/booksummaries.html\u003e\n\n```\nwget -P data/ http://www.cs.cmu.edu/~dbamman/data/booksummaries.tar.gz\ntar -xf data/booksummaries.tar.gz -C data\n```\n\n* Running `train.py` with the `--data books` flag reads and prepares the data from `data/booksummaries/booksummaries.txt`\n* Running `train.py` with the `--data books --pairs` flag creates Paired Book Summary by combining pairs of summaries and their labels\n\n\n### 3. Run the models\n\n```\ne.g. python train.py --model_name bertplusrandom --data books --pairs --batch_size 8 --epochs 20 --lr 3e-05\n```\n\ncf. Note that we use the source code for the CogLTX model: \u003chttps://github.com/Sleepychord/CogLTX\u003e\n\n### Hyperparameters used\n\n#### Hyperpartisan\n\n| Parameter  | BERT  | BERT+TextRank | BERT+Random | Longformer                                        | ToBERT |\n|------------|-------|---------------|-------------|---------------------------------------------------|--------|\n| Batch size | 8     | 8             | 8           | 16                                                | 8      |\n| Epochs     | 20    | 20            | 20          | 20                                                | 20     |\n| LR         | 3e-05 | 3e-05         | 5e-05       | 5e-05                                             | 5e-05  |\n| Scheduler  | NA    | NA            | NA          | [warmup](https://arxiv.org/abs/2004.05150)  | NA     |\n\n#### 20NewsGroups, Book Summary, Paired Book Summary\n\n| Parameter  | BERT  | BERT+TextRank | BERT+Random | Longformer                                        | ToBERT |\n|------------|-------|---------------|-------------|---------------------------------------------------|--------|\n| Batch size | 8     | 8             | 8           | 16                                                | 8      |\n| Epochs     | 20    | 20            | 20          | 20                                                | 20     |\n| LR         | 3e-05 | 3e-05         | 3e-05       | 0.005                                             | 3e-05  |\n| Scheduler  | NA    | NA            | NA          | [warmup](https://arxiv.org/abs/2004.05150)  | NA     |\n\n#### EURLEX, Inverted EURLEX\n\n| Parameter  | BERT  | BERT+TextRank | BERT+Random | Longformer                                        | ToBERT |\n|------------|-------|---------------|-------------|---------------------------------------------------|--------|\n| Batch size | 8     | 8             | 8           | 16                                                | 8      |\n| Epochs     | 20    | 20            | 20          | 20                                                | 20     |\n| LR         | 5e-05 | 5e-05         | 5e-05       | 0.005                                             | 5e-05  |\n| Scheduler  | NA    | NA            | NA          | [warmup](https://arxiv.org/abs/2004.05150)        | NA     |\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fefficient-longdoc-classification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famazon-science%2Fefficient-longdoc-classification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fefficient-longdoc-classification/lists"}