{"id":21888611,"url":"https://github.com/swisscom/ai-research-keyphrase-extraction","last_synced_at":"2025-04-05T10:09:42.648Z","repository":{"id":45609666,"uuid":"99651442","full_name":"swisscom/ai-research-keyphrase-extraction","owner":"swisscom","description":"EmbedRank: Unsupervised Keyphrase Extraction using Sentence Embeddings (official implementation)","archived":false,"fork":false,"pushed_at":"2023-04-07T11:55:38.000Z","size":186,"stargazers_count":435,"open_issues_count":6,"forks_count":88,"subscribers_count":33,"default_branch":"master","last_synced_at":"2025-03-27T04:11:19.265Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/swisscom.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-08-08T04:53:04.000Z","updated_at":"2025-03-05T11:27:56.000Z","dependencies_parsed_at":"2024-11-28T11:16:20.396Z","dependency_job_id":"f1d7ba48-ab1a-4b2f-b412-6646ce0028a7","html_url":"https://github.com/swisscom/ai-research-keyphrase-extraction","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/swisscom%2Fai-research-keyphrase-extraction","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/swisscom%2Fai-research-keyphrase-extraction/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/swisscom%2Fai-research-keyphrase-extraction/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/swisscom%2Fai-research-keyphrase-extraction/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/swisscom","download_url":"https://codeload.github.com/swisscom/ai-research-keyphrase-extraction/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247318745,"owners_count":20919484,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-28T11:16:09.725Z","updated_at":"2025-04-05T10:09:42.626Z","avatar_url":"https://github.com/swisscom.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"This is the implementation of the following paper: https://arxiv.org/abs/1801.04470\n\n# Installation\n\n## Local Installation\n\n1. Download full Stanford CoreNLP Tagger version 3.8.0\nhttp://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip\n\n2. Install sent2vec from \nhttps://github.com/epfml/sent2vec\n    * Clone/Download the directory\n    * go to sent2vec directory\n    * git checkout f827d014a473aa22b2fef28d9e29211d50808d48\n    * make\n    * pip install cython\n    * inside the src folder \n        * ``python setup.py build_ext``\n        * ``pip install . ``\n        * (In OSX) If the setup.py throws an **error** (ignore warnings), open setup.py and add '-stdlib=libc++' in the compile_opts list.        \n    * Download a pre-trained model (see readme of Sent2Vec repo) , for example wiki_bigrams.bin\n     \n3. Install requirements\n    \n    After cloning this repository go to the root directory and\n    ``pip install -r requirements.txt``\n\n4. Download NLTK data\n```\nimport nltk \nnltk.download('punkt')\n```\n\n5. Launch Stanford Core NLP tagger\n    * Open a new terminal\n    * Go to the stanford-core-nlp-full directory\n    * Run the server `java -mx4g -cp \"*\" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -preload tokenize,ssplit,pos -status_port 9000 -port 9000 -timeout 15000 \u0026 `\n\n\n6. Set the paths in config.ini.template\n    * You can leave [STANFORDTAGGER] parameters empty\n    * For [STANFORDCORENLPTAGGER] :\n        * set host to localhost\n        * set port to 9000\n    * For [SENT2VEC]:\n        * set your model_path to the pretrained model\n        your_path_to_model/wiki_bigrams.bin (if you choosed wiki_bigrams.bin)\n    * rename config.ini.template to config.ini\n\n## Docker\n\nProbably the easiest way to get started is by using the provided Docker image.\nFrom the project's root directory, the image can be built like so:\n```\n$ docker build . -t keyphrase-extraction\n```\nThis can take a few minutes to finish.\nAlso, keep in mind that pre-trained sent2vec models will not be downloaded since each model is several GBs in size and don't forget to allocate enough memory to your docker container (models are loaded in RAM).\n\nTo launch the model in an interactive mode, in order to use your own code, run\n```\n$ docker run -v {path to wiki_bigrams.bin}:/sent2vec/pretrained_model.bin -it keyphrase-extraction\n# Run the corenlp server\n/app # cd /stanford-corenlp\n/stanford-corenlp # nohup java -mx4g -cp \"*\" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -preload tokenize,ssplit,pos -status_port 9000 -port 9000 -timeout 15000 \u0026\n# Press enter to get stdin back\n/stanford-corenlp # cd /app\n/app # python\n\u003e\u003e\u003e import launch\n```\nYou have to specify the path to your sent2vec model using the `-v` argument.\nIf, for example, you should choose not to use the *wiki_bigrams.bin* model, adjust your path accordingly (and of course, remember to remove the curly brackets).\n\n# Usage\n\nOnce the CoreNLP server is running\n\n```\nimport launch\n\nembedding_distributor = launch.load_local_embedding_distributor()\npos_tagger = launch.load_local_corenlp_pos_tagger()\n\nkp1 = launch.extract_keyphrases(embedding_distributor, pos_tagger, raw_text, 10, 'en')  #extract 10 keyphrases\nkp2 = launch.extract_keyphrases(embedding_distributor, pos_tagger, raw_text2, 10, 'en')\n...\n```\n\nThis return for each text a tuple containing three lists:\n1) The top N candidates (string) i.e keyphrases\n2) For each keyphrase the associated relevance score\n3) For each keyphrase a list of alias (other candidates very similar to the one selected\nas keyphrase)\n\n# Method\n\nThis is the implementation of the following paper:\nhttps://arxiv.org/abs/1801.04470\n\n![embedrank](embedrank.gif)\n\nBy using sentence embeddings , EmbedRank embeds both the document and candidate phrases into the same embedding space.\n\nN candidates are selected as keyphrases by using Maximal Margin Relevance using the cosine similarity between the candidates and the\ndocument in order to model the informativness and the cosine\nsimilarity between the candidates is used to model the diversity.\n\nAn hyperparameter, beta (default=0.55), controls the importance given to \ninformativness and diversity when extracting keyphrases.\n(beta = 1 only informativness , beta = 0 only diversity)\nYou can change the beta hyperparameter value when calling extract_keyphrases:\n\n```\nkp1 = launch.extract_keyphrases(embedding_distributor, pos_tagger, raw_text, 10, 'en', beta=0.8)  #extract 10 keyphrases with beta=0.8\n\n```\n\nIf you want to replicate the results of the paper you have to set beta to 1 or 0.5 and turn off the alias feature by specifiying alias_threshold=1 to extract_keyphrases method.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fswisscom%2Fai-research-keyphrase-extraction","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fswisscom%2Fai-research-keyphrase-extraction","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fswisscom%2Fai-research-keyphrase-extraction/lists"}