{"id":18401705,"url":"https://github.com/borealisai/keyphrase-generation","last_synced_at":"2026-02-25T23:01:54.510Z","repository":{"id":96864038,"uuid":"301744407","full_name":"BorealisAI/keyphrase-generation","owner":"BorealisAI","description":"PyTorch code of “Diverse Keyphrase Generation with Neural Unlikelihood Training” (COLING 2020)","archived":false,"fork":false,"pushed_at":"2020-10-16T04:10:12.000Z","size":52,"stargazers_count":5,"open_issues_count":0,"forks_count":2,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-07-13T15:49:07.897Z","etag":null,"topics":["generative-model","keyphrase-generation","machine-learning","nlp","pytorch","seq2seq"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BorealisAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-06T13:55:44.000Z","updated_at":"2024-04-17T10:02:53.000Z","dependencies_parsed_at":"2023-07-17T07:16:28.428Z","dependency_job_id":null,"html_url":"https://github.com/BorealisAI/keyphrase-generation","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/BorealisAI/keyphrase-generation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BorealisAI%2Fkeyphrase-generation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BorealisAI%2Fkeyphrase-generation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BorealisAI%2Fkeyphrase-generation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BorealisAI%2Fkeyphrase-generation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BorealisAI","download_url":"https://codeload.github.com/BorealisAI/keyphrase-generation/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BorealisAI%2Fkeyphrase-generation/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29844845,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-25T22:37:40.667Z","status":"ssl_error","status_checked_at":"2026-02-25T22:37:25.960Z","response_time":61,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["generative-model","keyphrase-generation","machine-learning","nlp","pytorch","seq2seq"],"created_at":"2024-11-06T02:39:42.053Z","updated_at":"2026-02-25T23:01:54.506Z","avatar_url":"https://github.com/BorealisAI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Diverse Keyphrase Generation with Neural Unlikelihood Training \n\n![](https://img.shields.io/badge/python-3.7-brightgreen.svg) ![](https://img.shields.io/badge/torch-1.3.1-orange.svg)\n\nThis is the official codebase for the following paper, implemented in PyTorch:\n\nHareesh Bahuleyan and Layla El Asri. **Diverse Keyphrase Generation with Neural Unlikelihood Training.** COLING  2020. https://arxiv.org/pdf/2010.07665.pdf\n\n## Setup Instructions\n\n1. Create and activate Python 3.7.5 virtual environment using `conda`:\n    ```\n    conda create --name keygen python=3.7.5\n    source activate keygen\n    ```\n\n2. Install necessary packages using pip:\n    ```\n    pip install -r requirements.txt\n\n    # Download spacy model\n    python -m spacy download en_core_web_sm\n    ```\n\n3. Sent2Vec Installation\nSent2Vec is used in the evaluation script. \nPlease install sent2vec from https://github.com/epfml/sent2vec, using the steps below:\n\n    - Clone/Download the directory: `git clone https://github.com/epfml/sent2vec`\n    - Go to sent2vec directory: `cd sent2vec/`\n    - `git checkout f827d014a473aa22b2fef28d9e29211d50808d48`\n    - Run `make`\n    - Run `pip install cython`\n    - Inside the src folder: `cd src/`\n        - `python setup.py build_ext`\n        - `pip install .`\n    - Download a [pre-trained sent2vec model](https://github.com/epfml/sent2vec#downloading-sent2vec-pre-trained-models). For example, we used `sent2vec_wiki_unigrams`. Finally, copy it to `data/sent2vec/wiki_unigrams.bin`\n\n4. Data Download \nDownload the pre-processed data files in JSON format by visiting [this link](https://drive.google.com/drive/folders/1OZrLwW0_M5J-zUSYFZz2qxXNnayonox8?usp=sharing):\nUnzip the file and copy it to `data/`\n\n    The data folder should now have the following structure:\n    ```\n    data/\n    ├── kp20k_sorted/\n    ├── KPTimes/\n    │   └── kptimes_sorted/\n    ├── sample_testset/\n    ├── sent2vec/\n    │   └── wiki_unigrams.bin\n    └── stackexchange/\n        └── se_sorted/\n    ```\n\n## Training Instructions\n\nTo train a DivKGen model using one of the configurations provided under `configurations/`: \n\n```\n# Specify the dataset\nexport DATASET=kp20k\n\n# Specify the configuration name\nexport EXP=copy_seq2seq_attn_mle_greedy.tgt_15.0.copy_18.0\n\n# Run training script\nallennlp train configurations/$DATASET/$EXP.jsonnet -s output/$DATASET/$EXP/ -f --include-package keyphrase_generation -o '{ \"trainer\": {\"cuda_device\": 0} }'\n\n```\nThe outputs (training logs, model checkpoints, tensorboard logs) will be stored under: `output/$DATASET/$EXP`\n\n__Notes__:\n1. If your loss collapses NaN during training, this could be due to numerical underflow. The way to fix this is to edit `path/to/conda/envs/keygen/lib/python3.7/site-packages/allennlp/nn/utils.py` function `masked_log_softmax()` and change the line `vector = vector + (mask + 1e-45).log()` to `vector = vector + (mask + 1e-35).log()`.\n2. Similary, find and replace all instances of `1e-45` in `path/to/conda/envs/keygen/lib/python3.7/site-packages/allennlp/models/encoder_decoders/copynet_seq2seq.py` to `1e-35`\n3. During validation after every epoch, if it throws a Type Mismatch Error (`RuntimeError: \"argmax_cuda\" not implemented for 'Bool'`), this can be fixed by explicit type casting by changing the line `matches = (expanded_source_token_ids == expanded_target_token_ids)` to `matches = (expanded_source_token_ids == expanded_target_token_ids).int()` in `path/to/conda/envs/keygen/lib/python3.7/site-packages/allennlp/models/encoder_decoders/copynet_seq2seq.py`\n\n## Evaluation Instructions\nFinally, the evalution script can be run as follows:\n1. Go to `run_eval.sh`, set the `HOME_PATH` variable. This corresponds to the `absolute/path/to/keyphrase-generation/folder`\n2. Set the datasets. For instance, if we set both `EVALSET` and `DATASET` to `kp20k`, then we use the best model trained on `kp20k` to evaluate on `kp20k`. This is useful when you would like to evaluate a model trained on Dataset A on Dataset B. \n2. Next, `bash run_eval.sh` will print the quality and diversity results and also save them to `output/$DATASET/$EXP`\n\n_Note_: In the paper, we present EditDist as a diversity evaluation metric, for which we initially used a different fuzzy string matcher. However, this codebase uses an alternative library [rapidfuzz](https://github.com/maxbachmann/rapidfuzz), which offers a similar funcitonality.\n\n## Citation\nIf you found this code useful in your research, please cite:\n```\n@inproceedings{divKeyGen2020,\n  title={Diverse Keyphrase Generation with Neural Unlikelihood Training},\n  author={Bahuleyan, Hareesh and El Asri, Layla},\n  booktitle={Proceedings of the 28th International Conference on Computational Linguistics (COLING)},\n  year={2020}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fborealisai%2Fkeyphrase-generation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fborealisai%2Fkeyphrase-generation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fborealisai%2Fkeyphrase-generation/lists"}