{"id":33058458,"url":"https://github.com/NIHOPA/NLPre","last_synced_at":"2025-11-28T13:01:32.922Z","repository":{"id":49060819,"uuid":"86833359","full_name":"NIHOPA/NLPre","owner":"NIHOPA","description":"Python library for Natural Language Preprocessing (NLPre)","archived":false,"fork":false,"pushed_at":"2023-07-31T07:23:02.000Z","size":53485,"stargazers_count":191,"open_issues_count":2,"forks_count":35,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-09-16T05:35:11.056Z","etag":null,"topics":["natural-language-processing","nlp","nlp-parsing","python","text-processing"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NIHOPA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2017-03-31T15:22:09.000Z","updated_at":"2025-03-31T18:57:09.000Z","dependencies_parsed_at":"2024-01-14T06:56:46.612Z","dependency_job_id":"1b911e24-3a90-4eb7-ac9d-f4b52f492f50","html_url":"https://github.com/NIHOPA/NLPre","commit_stats":{"total_commits":355,"total_committers":4,"mean_commits":88.75,"dds":0.3690140845070422,"last_synced_commit":"1f7c05734026b39467fe521adbea7e799b91037a"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/NIHOPA/NLPre","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NIHOPA%2FNLPre","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NIHOPA%2FNLPre/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NIHOPA%2FNLPre/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NIHOPA%2FNLPre/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NIHOPA","download_url":"https://codeload.github.com/NIHOPA/NLPre/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NIHOPA%2FNLPre/sbom","scorecard":{"id":98123,"data":{"date":"2025-08-11","repo":{"name":"github.com/NIHOPA/NLPre","commit":"1f7c05734026b39467fe521adbea7e799b91037a"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.5,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Code-Review","score":0,"reason":"Found 0/26 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"License","score":0,"reason":"license file not detected","details":["Warn: project does not have a license file"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":9,"reason":"1 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2024-48 / GHSA-fj7x-q9j7-g6q6"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 5 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-15T09:31:57.885Z","repository_id":49060819,"created_at":"2025-08-15T09:31:57.885Z","updated_at":"2025-08-15T09:31:57.885Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27308347,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-28T02:00:06.623Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["natural-language-processing","nlp","nlp-parsing","python","text-processing"],"created_at":"2025-11-14T05:00:28.835Z","updated_at":"2025-11-28T13:01:32.914Z","avatar_url":"https://github.com/NIHOPA.png","language":"Python","funding_links":[],"categories":["Language Processing and Information Extraction"],"sub_categories":[],"readme":"# Natural Language Preprocessing (NLPre)\n\n[![Build Status](https://travis-ci.org/NIHOPA/NLPre.svg?branch=master)](https://travis-ci.org/NIHOPA/NLPre)\n[![codecov](https://codecov.io/gh/NIHOPA/NLPre/branch/master/graph/badge.svg)](https://codecov.io/gh/NIHOPA/NLPre)\n[![PyPI](https://img.shields.io/pypi/v/nlpre.svg)](https://pypi.python.org/pypi/nlpre)\n[![PyVersion](https://img.shields.io/pypi/pyversions/nlpre.svg)](https://img.shields.io/pypi/pyversions/nlpre.svg)\n\n## Major version update! NLPre 2.0.0\n\n+ Backend NLP engine `pattern.en` has been replaced with `spaCy` v 2.1.0. This is a major fix for some of the problems with `pattern.en` including poor lemmatization. (eg. cytokine -\u003e cytocow)\n+ Support for python 2 has been dropped\n+ Support for custom dictionaries in `replace_from_dictionary`\n+ Option for suffix to be used instead of prefix in `replace_from_dictionary`\n+ URL replacement can now remove emails\n+ `token_replacement` can remove symbols\n\nNLPre is a text (pre)-processing library that helps smooth some of the inconsistencies found in real-world data.\nCorrecting for issues like random capitalization patterns, strange hyphenations, and abbreviations are essential parts of wrangling textual data but are often left to the user.\n\nWhile this library was developed by the [Office of Portfolio Analysis](https://dpcpsi.nih.gov/opa/aboutus) at the [National Institutes of Health](https://www.nih.gov/) to correct for historical artifacts in our data, we envision this module to encompass a broad spectrum of problems encountered in the preprocessing step of natural language processing.\n\nNLPre is part of the [`word2vec-pipeline`](https://github.com/NIHOPA/word2vec_pipeline).\n\n### Installation\n\nFor the latest release, use\n\n    pip install nlpre\n\nIf installing the python 3 version on Ubuntu, you may need to use\n\n    sudo apt-get install libmysqlclient-dev\n\n### Example\n\n```python\nfrom nlpre import titlecaps, dedash, identify_parenthetical_phrases\nfrom nlpre import replace_acronyms, replace_from_dictionary\n\ntext = (\"LYMPHOMA SURVIVORS IN KOREA. Describe the correlates of unmet needs \"\n        \"among non-Hodgkin lymphoma (NHL) surv- ivors in Korea and identify \"\n        \"NHL patients with an abnormal white blood cell count.\")\n\nABBR = identify_parenthetical_phrases()(text)\nparsers = [dedash(), titlecaps(), replace_acronyms(ABBR),\n           replace_from_dictionary(prefix=\"MeSH_\")]\n\nfor f in parsers:\n    text = f(text)\n\nprint(text)\n\n''' lymphoma survivors in korea .\n    Describe the correlates of unmet needs among non_Hodgkin_lymphoma\n    ( non_Hodgkin_lymphoma ) survivors in Korea and identify non_Hodgkin_lymphoma\n    patients with an abnormal MeSH_Leukocyte_Count . '''\n```\n\nA longer example highlighting a \"pipeline\" of changes can be found [here](long_example.md).\n\nTo see a detailed log of the changes made, set the level to `logging.INFO` or `logging.DEBUG`,\n\n```python\nimport nlpre, logging\nnlpre.logger.setLevel(logging.INFO)\n```\n\n### What's included?\n\n| Function | Description |\n| --- | --- |\n| [**replace_from_dictionary**](nlpre/replace_from_dictionary.py) | Replace phrases from an input dictionary. The replacement is done without regard to case, but punctuation is handled correctly. The [MeSH ](https://www.nlm.nih.gov/mesh/) (Medical Subject Headings) dictionary is built-in. \u003cbr\u003e `(11-Dimethylethyl)-4-methoxyphenol is great` \u003cbr\u003e `MeSH_Butylated_Hydroxyanisole is great` |\n| [**replace_acronyms**](nlpre/replace_acronyms.py) | Replaces acronyms and abbreviations found in a document with their corresponding phrase. If an acronym is explicitly identified with a phrase in a document, then  all instances of that acronym in the document will be replaced with the given phrase. If there is no explicit indication what the phrase is within the document, then the most common phrase associated with the acronym in the given counter is used. \u003cbr\u003e `The EPA protects trees` \u003cbr\u003e `The Environmental_Protection_Agency protects trees`\n| [**identify_parenthetical_phrases**](nlpre/identify_parenthetical_phrases.py) | Identify abbreviations of phrases found in a parenthesis. Returns a counter and can be passed directly into [`replace_acronyms`](nlpre/replace_acronyms). \u003cbr\u003e `'Environmental Protection Agency (EPA)` \u003cbr\u003e `Counter((('Environmental', 'Protection', 'Agency'), 'EPA'):1)` |\n| [**separated_parenthesis**](nlpre/separated_parenthesis.py) | Separates parenthetical content into new sentences. This is useful when creating word embeddings, as associations should only be made within the same sentence. Terminal punctuation of a period is added to parenthetical sentences if necessary. \u003cbr\u003e `Hello (it is a beautiful day) world.` \u003cbr\u003e`Hello world. it is a beautiful day .` |\n| [**pos_tokenizer**](nlpre/pos_tokenizer.py) | Removes all words that are of a designated part-of-speech (POS) from a document. For example, when processing medical text, it is useful to remove all words that are not nouns or adjectives. POS detection is provided by the [`spaCy`](https://spacy.io/) module. \u003cbr\u003e `The boy threw the ball into the yard` \u003cbr\u003e `boy ball yard` |\n| [**unidecoder**](nlpre/unidecoder.py) | Converts Unicode phrases into ASCII equivalent. \u003cbr\u003e `α-Helix β-sheet` \u003cbr\u003e `a-Helix b-sheet` |\n| [**dedash**](nlpre/dedash.py) | Hyphenations are sometimes erroneously inserted when text is passed through a word-processor. This module attempts to correct the hyphenation pattern by joining words that if they appear in an English word list. \u003cbr\u003e `How is the treat- ment going` \u003cbr\u003e `How is the treatment going` |\n| [**decaps_text**](nlpre/decaps_text.py) | We presume that case is important, but only when it differs from title case. This class normalizes capitalization patterns. \u003cbr\u003e `James and Sally had a fMRI` \u003cbr\u003e `james and sally had a fMRI` |\n| [**titlecaps**](nlpre/titlecaps.py) | Documents sometimes have sentences that are entirely in uppercase (commonly found in titles and abstracts of older documents). This parser identifies sentences where every word is uppercase, and returns the document with these sentences converted to lowercase. \u003cbr\u003e `ON THE STRUCTURE OF WATER.` \u003cbr\u003e `On the structure of water .` |\n| [**token_replacement**](nlpre/token_replacement.py) | Simple token replacement. \u003cbr\u003e `Observed \u003e 20%` \u003cbr\u003e `Observed greater-than 20 percent` |\n| [**separate_reference**](nlpre/separate_reference.py) | Separates and optionally removes references that have been concatenated onto words. \u003cbr\u003e `Key feature of interleukin-1 in Drosophila3-5 and elegans(7).`\u003cbr\u003e`Key feature of interleukin-1 in Drosophila and elegans .` |\n| [**url_replacement**](nlpre/url_replacement.py) | Removes or replaces URLs \u003cbr\u003e `The source code is [here](www.github.com/NIHOPA/NLPre/).`\u003cbr\u003e`The source code is [here](LINK).` |\n\n\n## Citations and Acknowledgments\n\n+ He, Jian and Chaomei Chen. [Predictive Effects of Novelty Measured by Temporal Embeddings on the Growth of Scientific Literature.](https://www.frontiersin.org/articles/10.3389/frma.2018.00009/full) Frontiers in Research Metrics and Analytics, 3, 9. (2018).\n\n+ He, Jian and Chaomei Chen. [Temporal Representations of Citations for Understanding the Changing Roles of Scientific Publications.](https://www.frontiersin.org/articles/10.3389/frma.2018.00027) Front. Res. Metr. Anal. (2018).\n\n+ Galea, Dieter et al. [Sub-word information in pre-trained biomedical word representations: evaluation and hyper-parameter optimization.](http://www.aclweb.org/anthology/W18-2307) BioNLP (2018).\n\n## Contributors\n\n+ [Travis Hoppe](https://github.com/thoppe)\n+ [Harry Baker](https://github.com/HarryBaker)\n\n## License\n\nThis project is in the public domain within the United States, and\ncopyright and related rights in the work worldwide are waived through\nthe [CC0 1.0 Universal public domain dedication](https://creativecommons.org/publicdomain/zero/1.0/).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNIHOPA%2FNLPre","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNIHOPA%2FNLPre","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNIHOPA%2FNLPre/lists"}