{"id":37605901,"url":"https://github.com/dperezrada/keywords2vec","last_synced_at":"2026-01-16T10:09:08.512Z","repository":{"id":40176877,"uuid":"157625885","full_name":"dperezrada/keywords2vec","owner":"dperezrada","description":null,"archived":false,"fork":false,"pushed_at":"2023-04-12T00:21:21.000Z","size":1207,"stargazers_count":123,"open_issues_count":3,"forks_count":15,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-08-21T09:44:36.316Z","etag":null,"topics":["keywords-extraction","multi-language","nlp","phrase-extraction","text-mining"],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dperezrada.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-11-14T23:44:20.000Z","updated_at":"2025-01-19T16:56:27.000Z","dependencies_parsed_at":"2022-07-26T04:32:01.359Z","dependency_job_id":null,"html_url":"https://github.com/dperezrada/keywords2vec","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dperezrada/keywords2vec","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dperezrada%2Fkeywords2vec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dperezrada%2Fkeywords2vec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dperezrada%2Fkeywords2vec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dperezrada%2Fkeywords2vec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dperezrada","download_url":"https://codeload.github.com/dperezrada/keywords2vec/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dperezrada%2Fkeywords2vec/sbom","scorecard":{"id":354312,"data":{"date":"2025-08-11","repo":{"name":"github.com/dperezrada/keywords2vec","commit":"1c067dabafce8ad590fc6b1c255132bcb55f4415"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.5,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Code-Review","score":0,"reason":"Found 1/27 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/main.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/main.yml:7: update your workflow using https://app.stepsecurity.io/secureworkflow/dperezrada/keywords2vec/main.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/main.yml:8: update your workflow using https://app.stepsecurity.io/secureworkflow/dperezrada/keywords2vec/main.yml/master?enable=pin","Warn: pipCommand not pinned by hash: .github/workflows/main.yml:14","Warn: pipCommand not pinned by hash: .github/workflows/main.yml:15","Info:   0 out of   2 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   2 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 4 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Vulnerabilities","score":0,"reason":"34 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: GHSA-j6gc-792m-qgm2","Warn: Project is vulnerable to: GHSA-pj73-v5mw-pm9j","Warn: Project is vulnerable to: GHSA-jxhc-q857-3j6g","Warn: Project is vulnerable to: GHSA-48wp-p9qv-4j64","Warn: Project is vulnerable to: GHSA-4qw4-jpp4-8gvp","Warn: Project is vulnerable to: GHSA-636f-xm5j-pj9m","Warn: Project is vulnerable to: GHSA-7vh7-fw88-wj87","Warn: Project is vulnerable to: GHSA-fmx4-26r3-wxpf","Warn: Project is vulnerable to: GHSA-52p9-v744-mwjj","Warn: Project is vulnerable to: GHSA-mqm2-cgpr-p4m6","Warn: Project is vulnerable to: GHSA-286v-pcf5-25rc","Warn: Project is vulnerable to: GHSA-2qc6-mcvw-92cw","Warn: Project is vulnerable to: GHSA-2rr5-8q37-2w7h","Warn: Project is vulnerable to: GHSA-353f-x4gh-cqq8","Warn: Project is vulnerable to: GHSA-59gp-qqm7-cw4j","Warn: Project is vulnerable to: GHSA-5w6v-399v-w3cc","Warn: Project is vulnerable to: GHSA-7rrm-v45f-jp64","Warn: Project is vulnerable to: GHSA-cgx6-hpwq-fhv5","Warn: Project is vulnerable to: GHSA-crjr-9rc5-ghw8","Warn: Project is vulnerable to: GHSA-fq42-c5rg-92c2","Warn: Project is vulnerable to: GHSA-gx8x-g87m-h5q6","Warn: Project is vulnerable to: GHSA-jc36-42cf-vqwj","Warn: Project is vulnerable to: GHSA-jw9f-hh49-cvp9","Warn: Project is vulnerable to: GHSA-mrxw-mxhj-p664","Warn: Project is vulnerable to: GHSA-pxvg-2qj5-37jq","Warn: Project is vulnerable to: GHSA-r95h-9x8f-r3f7","Warn: Project is vulnerable to: GHSA-v4f8-2847-rwm7","Warn: Project is vulnerable to: GHSA-v6gp-9mmm-c6p5","Warn: Project is vulnerable to: GHSA-vr8q-g5c7-m54m","Warn: Project is vulnerable to: GHSA-vvfq-8hwr-qm4m","Warn: Project is vulnerable to: GHSA-xc9x-jj77-9p9j","Warn: Project is vulnerable to: GHSA-xh29-r2w5-wx8m","Warn: Project is vulnerable to: GHSA-xxx9-3xcr-gjj3","Warn: Project is vulnerable to: GHSA-5cm2-9h8c-rvfx"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-18T09:10:08.573Z","repository_id":40176877,"created_at":"2025-08-18T09:10:08.573Z","updated_at":"2025-08-18T09:10:08.573Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28478049,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T06:30:42.265Z","status":"ssl_error","status_checked_at":"2026-01-16T06:30:16.248Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["keywords-extraction","multi-language","nlp","phrase-extraction","text-mining"],"created_at":"2026-01-16T10:09:07.984Z","updated_at":"2026-01-16T10:09:08.501Z","avatar_url":"https://github.com/dperezrada.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# keywords2vec\n\u003e A simple and fast way to generate a word2vec model, with multi-word keywords instead of single words.\n\n\n## Example result\n\nFinding similar keywords for \"__obesity__\"\n\n| index | term                        |\n|-------|-----------------------------|\n| 0     | overweight                  |\n| 1     | obese                       |\n| 2     | physical inactivity         |\n| 3     | excess weight               |\n| 4     | obese adults                |\n| 5     | high bmi                    |\n| 6     | obese adults                |\n| 7     | obese people                |\n| 8     | obesity-related outcomes    |\n| 9     | obesity among children      |\n| 10    | poor sleep quality          |\n| 11    | ssbs                        |\n| 12    | obese populations           |\n| 13    | cardiometabolic risk        |\n| 14    | abdominal obesity           |\n\n\n## Install\n\n`pip install keywords2vec`\n\n## How to use\n\nLets download some example data\n\n```\ndata_filepath = \"epistemonikos_data_sample.tsv.gz\"\n\n!wget \"https://s3.amazonaws.com/episte-labs/epistemonikos_data_sample.tsv.gz\" -O \"{data_filepath}\"\n```\n\nImport\n\n```\nfrom keywords2vec.main import similars_tree, get_similars\n```\n\n\nWe create the model.\n\n```\nlabels, tree = similars_tree(data_filepath)\n```\n\nMore info, take a look [here](30_main.ipynb)\n\n\n\nThen we can get the most similars keywords\n\n```\nget_similars(tree, labels, \"obesity\")\n```\n\n\n\n\n    ['obesity',\n     'overweight',\n     'obese',\n     'physical inactivity',\n     'excess weight',\n     'high bmi',\n     'obese adults',\n     'obese people',\n     'obesity-related outcomes',\n     'obesity among children',\n     'poor sleep quality',\n     'ssbs',\n     'obese populations',\n     'cardiometabolic risk',\n     'abdominal obesity']\n\n\n\n```\nget_similars(tree, labels, \"heart failure\")\n```\n\n\n\n\n    ['heart failure',\n     'hf',\n     'chf',\n     'chronic heart failure',\n     'reduced ejection fraction',\n     'unstable angina',\n     'peripheral vascular disease',\n     'peripheral arterial disease',\n     'angina',\n     'congestive heart failure',\n     'left ventricular systolic dysfunction',\n     'acute coronary syndrome',\n     'heart failure patients',\n     'acute myocardial infarction',\n     'left ventricular dysfunction']\n\n\n\n### Motivation\n\nThe idea started in the Epistemonikos database [www.epistemonikos.org](https://www.epistemonikos.org), a database of scientific articles for people making decisions concerning clinical or health-policy questions. In this context the scientific/health language used is complex. You can easily find keywords like:\n\n * asthma\n * heart failure\n * medial compartment knee osteoarthritis\n * preserved left ventricular systolic function\n * non-selective non-steroidal anti-inflammatory drugs\n \nWe tried some approaches to find those keywords, like ngrams, ngrams + tf-idf, identify entities, among others. But we didn't get really good results.\n\n\n### Our approach\n\nWe found that tokenizing using stopwords + non word characters was really useful for \"finding\" the keywords. An example:\n\n* input: \"Timing of replacement therapy for acute renal failure after cardiac surgery\"\n* output: [\n\t\"timing\",\n\t\"replacement therapy\",\n\t\"acute renal failure\",\n\t\"cardiac surgery\"\n]\n\nSo we basically split the text when we find:\n * a stopword\n * a non word character(/,!?. etc) (except from - and ')\n\nThat's it.\n\nBut as there were some problem with some keywords that cointain stopwords, like:\n * Vitamin A\n * Hepatitis A\n * Web of Science\n\nSo we decided to add another method (nltk with some grammar definition) to cover most of the cases. To use this, you need to add the parameter `keywords_w_stopwords=True`, this method is approx 20x slower.\n\n### References\n\nSeem to be an old idea (2004):\n\n*Mihalcea, Rada, and Paul Tarau. \"Textrank: Bringing order into text.\" Proceedings of the 2004 conference on empirical methods in natural language processing. 2004.*\n\nReading an implementation of textrank, I realize they used stopwords to separate and create the graph. Then I though in using it as tokenizer for word2vec\n\nAs pointed by @deliprao in this [twitter thread](https://twitter.com/jeremyphoward/status/1094025901371621376). It's also used by Rake (2010):\n\n*Rose, Stuart \u0026 Engel, Dave \u0026 Cramer, Nick \u0026 Cowley, Wendy. (2010). Automatic Keyword Extraction from Individual Documents. 10.1002/9780470689646.ch1.*\n\nAs noted by @astent in the Twitter thread, this concept is called chinking (chunking by exclusion)\n[https://www.nltk.org/book/ch07.html#Chinking](https://www.nltk.org/book/ch07.html#Chinking)\n\n\n### Multi-lingual\nWe worked in an implementation, that could be used in multiple languages. Of course not all languages are sutable for using this approach. We have tried with good results in English, Spanish and Portuguese\n\n\n## Try it online\n\nYou can try it [here](http://54.196.169.11/episte/) (takes time to load, lowercase only, doesn't work in mobile yet) MPV :)\n\nThese embedding were created using 827,341 title/abstract from @epistemonikos database.\nWith keywords that repeat at least 10 times. The total vocab is 349,080 keywords (really manageable number)\n\n## Vocab size\n\nOne of the main benefit of this method, is the size of the vocabulary. \nFor example, using keywords that repeat at least 10 times, for the Epistemonikos dataset (827,341 title/abstract), we got the following vocab size:\n\n| ngrams             | keywords  | comp    |\n|--------------------|-----------|---------|\n| 1                  | 127,824   | 36%     |\n| 1,2                | 1,360,550 | 388%    |\n| 1-3                | 3,204,099 | 914%    |\n| 1-4                | 4,461,930 | 1,272%  |\n| 1-5                | 5,133,619 | 1,464%  |\n|                    |           |         |\n| stopword tokenizer | 350,529   | 100%    |\n\nMore information regarding the comparison, take a look to the folder [analyze](analyze).\n\n\n## Credits\n\nThis project has been created using [nbdev](https://github.com/fastai/nbdev)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdperezrada%2Fkeywords2vec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdperezrada%2Fkeywords2vec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdperezrada%2Fkeywords2vec/lists"}