{"id":19031467,"url":"https://github.com/plandes/clj-nlp-parse","last_synced_at":"2025-04-23T16:40:41.752Z","repository":{"id":80107493,"uuid":"63271149","full_name":"plandes/clj-nlp-parse","owner":"plandes","description":"Natural Language Parsing and Feature Generation","archived":false,"fork":false,"pushed_at":"2024-12-01T19:23:52.000Z","size":6127,"stargazers_count":38,"open_issues_count":0,"forks_count":2,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-18T01:47:59.127Z","etag":null,"topics":["clojure","natural-language-processing","parsing","semantic-role-labeling","stanford-corenlp"],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/plandes.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2016-07-13T18:45:27.000Z","updated_at":"2024-12-01T19:23:56.000Z","dependencies_parsed_at":null,"dependency_job_id":"a1511a64-2184-49e7-990b-ee094d1a1907","html_url":"https://github.com/plandes/clj-nlp-parse","commit_stats":null,"previous_names":[],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plandes%2Fclj-nlp-parse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plandes%2Fclj-nlp-parse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plandes%2Fclj-nlp-parse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plandes%2Fclj-nlp-parse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/plandes","download_url":"https://codeload.github.com/plandes/clj-nlp-parse/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250471915,"owners_count":21436044,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clojure","natural-language-processing","parsing","semantic-role-labeling","stanford-corenlp"],"created_at":"2024-11-08T21:23:46.190Z","updated_at":"2025-04-23T16:40:41.732Z","avatar_url":"https://github.com/plandes.png","language":"Clojure","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Natural Language Parse and Feature Generation\n\nA Clojure language library to parse natural language text into features useful\nfor machine learning model.\n\nFeatures include:\n\n* Wraps several Java natural language parsing libraries.\n* Gives access the data structures rendered by the parsers.\n* Provides utility functions to create features.\n\nThis framework combines the results of the following frameworks:\n* [Stanford CoreNLP 3.8.0](https://github.com/stanfordnlp/CoreNLP)\n* [ClearNLP 2.0.2](https://github.com/emorynlp/nlp4j)\n* [Stop Word Annotator](https://github.com/plandes/stopword-annotator)\n\n\u003c!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc --\u003e\n## Table of Contents\n\n- [Features](#features)\n- [Obtaining](#obtaining)\n- [Documentation](#documentation)\n    - [API Documentation](#api-documentation)\n    - [Annotation Definitions](#annotation-definitions)\n- [Example Parse](#example-parse)\n- [Setup](#setup)\n    - [Download and Install POS Tagger Model Manually](#download-and-install-pos-tagger-model-manually)\n    - [REPL](#repl)\n- [Usage](#usage)\n    - [Usage Example](#usage-example)\n    - [Parsing an Utterance](#parsing-an-utterance)\n    - [Utility Functions](#utility-functions)\n    - [Feature Creation](#feature-creation)\n    - [Stopword Filtering](#stopword-filtering)\n    - [Dictionary Utility](#dictionary-utility)\n    - [Pipeline Configuration](#pipeline-configuration)\n        - [Pipeline Usage](#pipeline-usage)\n        - [Convenience Namespace](#convenience-namespace)\n    - [Command Line Usage](#command-line-usage)\n- [Building](#building)\n- [Changelog](#changelog)\n- [Citation](#citation)\n- [References](#references)\n- [License](#license)\n\n\u003c!-- markdown-toc end --\u003e\n\n\n## Features\n\n* [Callable](https://dzone.com/articles/java-clojure-interop-calling) from Java\n* [Callable](https://github.com/plandes/clj-nlp-serv) from REST\n* Callable from REST in a [Docker Image](https://hub.docker.com/r/plandes/nlpservice/)\n* Completely customize.\n* Easily extendable.\n* Combines all annotations as pure Clojure data structures.\n* Provides a feature creation libraries:\n  - [Character](https://plandes.github.io/clj-nlp-feature/codox/zensols.nlparse.feature.char.html)\n  - [Dictionary, Word Lists](https://plandes.github.io/clj-nlp-feature/codox/zensols.nlparse.feature.word.html)\n  - [Language (SRL, POS, etc)](https://plandes.github.io/clj-nlp-parse/codox/zensols.nlparse.feature.lang.html)\n  - [Word Counts](https://plandes.github.io/clj-nlp-parse/codox/zensols.nlparse.feature.word-count.html)\n* Stitches multiple frameworks to provide the following features:\n  - [Tokenizing](https://en.wikipedia.org/wiki/Lexical_analysis#Token)\n  - Grouping Tokens into Sentences\n  - [Lemmatisation](https://en.wikipedia.org/wiki/Lemmatisation)\n  - [Part of Speech Tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging)\n  - [Stop Words](https://en.wikipedia.org/wiki/Stop_words) (both word and\n    lemma)\n  - [Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)\n  - [Syntactic Parse Tree](https://en.wikipedia.org/wiki/Parse_tree)\n  - [Fast Shift Reduce Parse Tree](https://en.wikipedia.org/wiki/Shift-reduce_parser)\n  - [Dependency Tree](https://en.wikipedia.org/wiki/Dependency_grammar)\n  - [Co-reference Graph](https://en.wikipedia.org/wiki/Coreference)\n  - [Sentiment Analysis](https://en.wikipedia.org/wiki/Sentiment_analysis)\n  - [Semantic Role Labeler](https://en.wikipedia.org/wiki/Semantic_role_labeling)\n* Seamless itegration with other feature creation libraries:\n  * [General NLP feature creation]\n  * [Word vector feature creation]\n\n\n## Obtaining\n\nIn your `project.clj` file, add:\n\n[![Clojars Project](https://clojars.org/com.zensols.nlp/parse/latest-version.svg)](https://clojars.org/com.zensols.nlp/parse/)\n\n\n## Documentation\n\n### API Documentation\n\n* [Clojure](https://plandes.github.io/clj-nlp-parse/codox/index.html)\n* [Java](https://plandes.github.io/clj-nlp-parse/apidocs/index.html)\n\n\n### Annotation Definitions\n\nThe utterance parse annotation tree\ndefinitions is [given here](doc/annotation-definitions.md).\n\n\n## Example Parse\n\nAn example of a full annotation parse is [given here](doc/example-parse.md).\n\n## Setup\n\nThe NER model is included in the Stanford CoreNLP dependencies, but you still\nhave to download the POS model.  To download (or create a symbolic link if\nyou've set the `ZMODEL` environment variable):\n```bash\n$ make model\n```\n\nIf this doesn't work, follow\nthe [manual](#download-and-install-pos-tagger-model-manually) steps.  Otherwise\nyou can optionally move the model to a shared location on the file system and\nskip to [configuring the REPL](#repl).\n\n\n### Download and Install POS Tagger Model Manually\n\nIf the [normal setup](#setup) failed, you'll have to manually download the POS\ntagger model.\n\nThe library can be configured to use any POS model (or NER for that matter),\nbut by default it expects\nthe\n[english-left3words-distsim.tagger model](http://nlp.stanford.edu/software/pos-tagger-faq.shtml).\n\n1. Create a directory where to put the model\n   ```bash\n   $ mkdir -p path-to-model/stanford/pos\n   ```\n\n2. Download the [english-left3words-distsim.tagger model](http://nlp.stanford.edu/software/stanford-postagger-2015-12-09.zip)\n   the or [similar](http://nlp.stanford.edu/software/tagger.shtml#Download) model.\n\n3. Install the model file:\n   ```bash\n   $ unzip stanford-postagger-2015-12-09.zip\n   $ mv stanford-postagger-2015-12-09/models/english-left3words-distsim.tagger path-to-model/stanford/pos\n   ```\n\n### REPL\n\nIf you download the model in to any other location other that the current start\ndirectory (see [setup](#setup)) you will have to tell the REPL where the model\nis kept on the file system.\n\nStart the REPL and configure:\n   ```clojure\n   user\u003e (System/setProperty \"zensols.model\" \"path-to-model\")\n   ```\n\nNote that system properties can be passed via `lein` to avoid having to repeat\nthis for each REPL instance.\n\n\n## Usage\n\nThis package supports:\n* [Parsing an Utterance](#parsing-an-utterance)\n* [Utility Functions](#utility-functions)\n* [Dictionary Utility](#dictionary-utility)\n* [Stopword Filtering](#stopword-filtering)\n* [Command Line Usage](#command-line-usage)\n\n\n### Usage Example\n\nSee the [example repo](https://github.com/plandes/clj-example-nlp-ml) that\nillustrates how to use this library and contains the code from where these\nexamples originate.  It's highly recommended to clone it and follow along as\nyou peruse this README.\n\n\n### Parsing an Utterance\n```clojure\nuser\u003e (require '[zensols.nlparse.parse :refer (parse)])\nuser\u003e (clojure.pprint/pprint (parse \"I am Paul Landes.\"))\n=\u003e {:text \"I am Paul Landes.\",\n :mentions\n ({:entity-type \"PERSON\",\n   :token-range [2 4],\n   :ner-tag \"PERSON\",\n   :sent-index 0,\n   :char-range [5 16],\n   :text \"Paul Landes\"}),\n :sents\n ({:text \"I am Paul Landes.\",\n   :sent-index 0,\n   :parse-tree\n   {:label \"ROOT\",\n    :child\n    ({:label \"S\",\n      :child\n      ({:label \"NP\",\n        :child ({:label \"PRP\", :child ({:label \"I\", :token-index 1})})}\n...\n   :dependency-parse-tree\n   ({:token-index 4,\n     :text \"Landes\",\n     :child\n     ({:dep \"nsubj\", :token-index 1, :text \"I\"}\n      {:dep \"cop\", :token-index 2, :text \"am\"}\n      {:dep \"compound\", :token-index 3, :text \"Paul\"}\n      {:dep \"punct\", :token-index 5, :text \".\"})}),\n...\n   :tokens\n   ({:token-range [0 1],\n     :ner-tag \"O\",\n     :pos-tag \"PRP\",\n     :lemma \"I\",\n     :token-index 1,\n     :sent-index 0,\n     :char-range [0 1],\n     :text \"I\",\n     :srl\n     {:id 1,\n      :propbank nil,\n      :head-id 2,\n      :dependency-label \"root\",\n      :heads ({:function-tag \"PPT\", :dependency-label \"A1\"})}}\n...\n```\n\n\n### Utility Functions\n\nThere utility function to have with getting around the parsed data, as it can\nbe pretty large.  For example, to find the head of the dependency head tree:\n```clojure\n(def panon (parse \"I am Paul Landes.\"))\n=\u003e {:text...\nuser\u003e (-\u003e\u003e panon :sents first p/root-dependency :text)\n=\u003e \"Landes\"\n```\n\nIn this case, the last name is the head of tree and happens to be a named\nentity as detected by the Stanford CoreNLP NER system.  Named entities are\nannotatated at the token level, but also included in the *mentions* top level\nwith the entire set of concatenated tokens (for cases where an NER contains\nmore than one token like in this case).  To get the full mention text:\n```clojure\nuser\u003e (-\u003e\u003e panon :sents first p/root-dependency\n                (p/mention-for-token panon)\n                first :text))\n=\u003e \"Paul Landes\"\n```\n\n### Feature Creation\n\nThis library was written to generate features for a machine learning\nalgoritms.  There are some utility functions for doing this.\n\nOther feature libraries the integrate with this library:\n* [General NLP feature creation]\n* [Word vector feature creation]\n\nBelow are examples of feature creation with just this library.\n\nGet the first propbank parsed from the SRL:\n```clojure\nuser\u003e (-\u003e\u003e panon f/first-propbank-label)\n=\u003e \"be.01\"\n```\n\nGet stats on features:\n```clojure\nuser\u003e (-\u003e\u003e panon p/tokens (f/token-features panon))\n=\u003e {:utterance-length 17,\n    :mention-count 1,\n\t:sent-count 1,\n\t:token-count 5,\n\t:token-average-length 14/5,\n\t:is-question false}\n```\n\nEach function `X` has an analog function `X-feature-keys` that describes the\nfeatures generates and their types, which can be used directly as Weka\nattributes:\n```clojure\nuser\u003e (clojure.pprint/pprint (f/token-feature-metas))\n=\u003e [[:utterance-length numeric]\n    [:mention-count numeric]\n\t[:sent-count numeric]\n\t[:token-count numeric]\n\t[:token-average-length numeric]\n\t[:is-question boolean]]\n```\n\nGet in/out-of-vocabulary ratio:\n```clojure\nuser\u003e (-\u003e\u003e panon p/tokens f/dictionary-features)\n=\u003e {:in-dict-ratio 4/5}\n```\n\nWord count features provide distributions over word counts.\nSee the [unit test](test/zensols/nlparse/word_count_test.clj).\n\n\n### Stopword Filtering\n\nFilter \n```clojure\nuser\u003e (require '[zensols.nlparse.parse :as p])\nuser\u003e (require '[zensols.nlparse.stopword :as st])\nuser\u003e (-\u003e\u003e (p/parse \"This is a test.  This will filter 5 semantically significant words.\")\n           p/tokens\n           st/go-word-forms)\n=\u003e (\"test\" \"filter\" \"semantically\" \"significant\" \"words\")\n```\n\nSee the [unit test](test/zensols/nlparse/stopword_test.clj).\n\n\n### Dictionary Utility\n\nSee the [NLP feature library](https://github.com/plandes/clj-nlp-feature) for\nmore information on dictionary specifics.\n\n\n### Pipeline Configuration\n\nYou can not only configure the natural language processing pipeline and which\nspecific components to use, but you can also define and add your own plugin\nlibrary.  See the\n[config namespace](https://plandes.github.io/clj-nlp-parse/codox/zensols.nlparse.config.html)\nfor more information.\n\n\n#### Pipeline Usage\n\nFor example, if all you need is tokenization and sentence chunking create a\ncontext and parse it using macro `with-context` and the context you create with\nspecific components:\n```clojure\n(require '[zensols.nlparse.config :as conf :refer (with-context)]\n         '[zensols.nlparse.parse :refer (parse)])\n\n(let [ctx (-\u003e\u003e (conf/create-parse-config\n                :pipeline [(conf/tokenize)\n                           (conf/sentence)])\n               conf/create-context)]\n  (with-context ctx\n    (parse \"I love Clojure.  I enjoy it.\")))\n```\n\nYou can also specify the configuration in the form of a string:\n```clojure\n(let [ctx (conf/create-context \"tokenize,sentence,part-of-speech\")]\n  (with-context ctx\n    (parse \"I love Clojure.  I enjoy it.\")))\n```\n\nThe configuration string can also take parameters (ex the `en` parameter to the\ntokenizer specifying English as the natural language):\n```clojure\n(let [ctx (conf/create-context \"tokenize(en),sentence,part-of-speech\")]\n  (with-context ctx\n    (parse \"I love Clojure.  I enjoy it.\")))\n```\n\nFor an example on how to configure the pipeline, see\n[this test case](https://github.com/plandes/clj-nlp-parse/blob/master/test/zensols/nlparse/ner_test.clj#L12-L20).\nFor more information on the DSL itself see the\n[DSL parser](https://github.com/plandes/clj-nlp-parse/blob/master/src/clojure/zensols/nlparse/config_parse.clj).\n\n\n#### Convenience Namespace\n\nIf you use a particular configuration that doesn't change often consider your\nown utility parse namespace:\n\n```clojure\n(ns example.nlp.parse\n  (:require [zensols.nlparse.parse :as p]\n            [zensols.nlparse.config :as conf :refer (with-context)]))\n\n(defonce ^:private parse-context-inst (atom nil))\n\n(defn- create-context []\n  (-\u003e\u003e [\"tokenize\"\n        \"sentence\"\n        \"part-of-speech\"\n        \"morphology\"\n        \"named-entity-recognizer\"\n        \"parse-tree\"]\n       (clojure.string/join \",\")\n       conf/create-context))\n\n(defn- context []\n  (swap! parse-context-inst #(or % (create-context))))\n\n(defn parse [utterance]\n  (with-context (context)\n    (p/parse utterance)))\n```\n\nNow in your application namespace:\n\n```clojure\n(ns example.nlp.core\n  (:require [example.nlp.parse :as p]))\n\n(defn somefn []\n  (p/parse \"an utterance\"))\n```\n\n\n### Command Line Usage\n\nThe command line usage of this project has moved to\nthe [NLP server](https://github.com/plandes/clj-nlp-serv#comand-line-usage).\n\n\n## Building\n\nTo build from source, do the folling:\n\n- Install [Leiningen](http://leiningen.org) (this is just a script)\n- Install [GNU make](https://www.gnu.org/software/make/)\n- Install [Git](https://git-scm.com)\n- Download the source: `git clone --recurse-submodules https://github.com/plandes/clj-nlp-parse \u0026\u0026 cd clj-nlp-parse`\n- Build the software: `make jar`\n- Build the distribution binaries: `make dist`\n\nNote that you can also build a single jar file with all the dependencies with: `make uber`\n\n\n## Changelog\n\nAn extensive changelog is available [here](CHANGELOG.md).\n\n\n## Citation\n\nIf you use this software in your research, please cite with the following\nBibTeX:\n\n```jflex\n@misc{plandes-clj-nlp-parse,\n  author = {Paul Landes},\n  title = {Natural Language Parse and Feature Generation},\n  year = {2018},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/plandes/clj-nlp-parse}}\n}\n```\n\n\n## References\n\nSee the [General NLP feature creation] library for additional references.\n\n```jflex\n@phdthesis{choi2014optimization,\n  title = {Optimization of natural language processing components for robustness and scalability},\n  author = {Choi, Jinho D},\n  year = {2014},\n  school = {University of Colorado Boulder}\n}\n\n@InProceedings{manning-EtAl:2014:P14-5,\n  author = {Manning, Christopher D. and  Surdeanu, Mihai  and  Bauer, John  and  Finkel, Jenny  and  Bethard, Steven J. and  McClosky, David},\n  title = {The {Stanford} {CoreNLP} Natural Language Processing Toolkit},\n  booktitle = {Association for Computational Linguistics (ACL) System Demonstrations},\n  year = {2014},\n  pages = {55--60},\n  url = {http://www.aclweb.org/anthology/P/P14/P14-5010}\n}\n```\n\n\n## License\n\nCopyright (c) 2016 - 2024 Paul Landes\n\nPermission is hereby granted, free of charge, to any person obtaining a copy of\nthis software and associated documentation files (the \"Software\"), to deal in\nthe Software without restriction, including without limitation the rights to\nuse, copy, modify, merge, publish, distribute, sublicense, and/or sell copies\nof the Software, and to permit persons to whom the Software is furnished to do\nso, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplandes%2Fclj-nlp-parse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fplandes%2Fclj-nlp-parse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplandes%2Fclj-nlp-parse/lists"}