{"id":13482638,"url":"https://github.com/dakrone/clojure-opennlp","last_synced_at":"2025-10-06T23:45:01.446Z","repository":{"id":41948336,"uuid":"521253","full_name":"dakrone/clojure-opennlp","owner":"dakrone","description":"Natural Language Processing in Clojure (opennlp)","archived":false,"fork":false,"pushed_at":"2018-11-27T14:37:14.000Z","size":33612,"stargazers_count":757,"open_issues_count":4,"forks_count":82,"subscribers_count":64,"default_branch":"master","last_synced_at":"2025-05-10T09:15:03.211Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"epl-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dakrone.png","metadata":{"files":{"readme":"README.markdown","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2010-02-17T01:47:08.000Z","updated_at":"2025-04-21T18:26:28.000Z","dependencies_parsed_at":"2022-09-21T08:53:59.070Z","dependency_job_id":null,"html_url":"https://github.com/dakrone/clojure-opennlp","commit_stats":null,"previous_names":[],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dakrone%2Fclojure-opennlp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dakrone%2Fclojure-opennlp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dakrone%2Fclojure-opennlp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dakrone%2Fclojure-opennlp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dakrone","download_url":"https://codeload.github.com/dakrone/clojure-opennlp/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254355328,"owners_count":22057353,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T17:01:03.985Z","updated_at":"2025-10-06T23:44:56.412Z","avatar_url":"https://github.com/dakrone.png","language":"Clojure","funding_links":[],"categories":["Clojure","Text Processing","Packages","函式庫"],"sub_categories":["Tools","[Tools](#tools-1)","Speech Recognition","Libraries","書籍"],"readme":"Clojure library interface to OpenNLP - https://opennlp.apache.org/\n============================================================\n\nA library to interface with the OpenNLP (Open Natural Language Processing)\nlibrary of functions. Not all functions are implemented yet.\n\nAdditional information/documentation:\n\n- [Natural Language Processing in Clojure with clojure-opennlp](http://writequit.org/blog/index.html%3Fp=365.html)\n- [Context searching using Clojure-OpenNLP](http://writequit.org/blog/index.html%3Fp=351.html)\n\nRead the source from Marginalia\n\n- http://dakrone.github.com/clojure-opennlp/\n\n[![Continuous Integration status](https://secure.travis-ci.org/dakrone/clojure-opennlp.png)](http://travis-ci.org/dakrone/clojure-opennlp)\n\nKnown Issues\n------------\n- When using the treebank-chunker on a sentence, please ensure you\nhave a period at the end of the sentence, if you do not have a period,\nthe chunker gets confused and drops the last word. Besides, your\nsentences should all be grammactially correct anyway right?\n\n\nUsage from Leiningen:\n--------------------\n\n```clojure\n[clojure-opennlp \"0.5.0\"] ;; uses Opennlp 1.9.0\n```\n\nclojure-opennlp works with clojure 1.5+\n\nBasic Example usage (from a REPL):\n----------------------------------\n\n```clojure\n(use 'clojure.pprint) ; just for this documentation\n(use 'opennlp.nlp)\n(use 'opennlp.treebank) ; treebank chunking, parsing and linking lives here\n```\n\nYou will need to make the processing functions using the model files. These\nassume you're running from the root project directory. You can also download\nthe model files from the opennlp project at [http://opennlp.sourceforge.net/models-1.5](http://opennlp.sourceforge.net/models-1.5)\n\n```clojure\n(def get-sentences (make-sentence-detector \"models/en-sent.bin\"))\n(def tokenize (make-tokenizer \"models/en-token.bin\"))\n(def detokenize (make-detokenizer \"models/english-detokenizer.xml\"))\n(def pos-tag (make-pos-tagger \"models/en-pos-maxent.bin\"))\n(def name-find (make-name-finder \"models/namefind/en-ner-person.bin\"))\n(def chunker (make-treebank-chunker \"models/en-chunker.bin\"))\n```\n\nThe tool-creators are multimethods, so you can also create any of the\ntools using a model instead of a filename (you can create a model with\nthe training tools in src/opennlp/tools/train.clj):\n\n```clojure\n(def tokenize (make-tokenizer my-tokenizer-model)) ;; etc, etc\n```\n\nThen, use the functions you've created to perform operations on text:\n\nDetecting sentences:\n\n```clojure\n(pprint (get-sentences \"First sentence. Second sentence? Here is another one. And so on and so forth - you get the idea...\"))\n[\"First sentence. \", \"Second sentence? \", \"Here is another one. \",\n \"And so on and so forth - you get the idea...\"]\n```\n\nTokenizing:\n\n```clojure\n(pprint (tokenize \"Mr. Smith gave a car to his son on Friday\"))\n[\"Mr.\", \"Smith\", \"gave\", \"a\", \"car\", \"to\", \"his\", \"son\", \"on\",\n \"Friday\"]\n```\n\nDetokenizing:\n\n```clojure\n(detokenize [\"Mr.\", \"Smith\", \"gave\", \"a\", \"car\", \"to\", \"his\", \"son\", \"on\", \"Friday\"])\n\"Mr. Smith gave a car to his son on Friday.\"\n```\n\nIdeally, s == (detokenize (tokenize s)), the detokenization model XML\nfile is a work in progress, please let me know if you run into\nsomething that doesn't detokenize correctly in English.\n\n\nPart-of-speech tagging:\n\n```clojure\n(pprint (pos-tag (tokenize \"Mr. Smith gave a car to his son on Friday.\")))\n([\"Mr.\" \"NNP\"]\n [\"Smith\" \"NNP\"]\n [\"gave\" \"VBD\"]\n [\"a\" \"DT\"]\n [\"car\" \"NN\"]\n [\"to\" \"TO\"]\n [\"his\" \"PRP$\"]\n [\"son\" \"NN\"]\n [\"on\" \"IN\"]\n [\"Friday.\" \"NNP\"])\n```\n\nName finding:\n\n```clojure\n(name-find (tokenize \"My name is Lee, not John.\"))\n(\"Lee\" \"John\")\n```\n\nTreebank-chunking splits and tags phrases from a pos-tagged sentence.\nA notable difference is that it returns a list of structs with the\n:phrase and :tag keys, as seen below:\n\n```clojure\n(pprint (chunker (pos-tag (tokenize \"The override system is meant to deactivate the accelerator when the brake pedal is pressed.\"))))\n({:phrase [\"The\" \"override\" \"system\"], :tag \"NP\"}\n {:phrase [\"is\" \"meant\" \"to\" \"deactivate\"], :tag \"VP\"}\n {:phrase [\"the\" \"accelerator\"], :tag \"NP\"}\n {:phrase [\"when\"], :tag \"ADVP\"}\n {:phrase [\"the\" \"brake\" \"pedal\"], :tag \"NP\"}\n {:phrase [\"is\" \"pressed\"], :tag \"VP\"})\n```\n\nFor just the phrases:\n\n```clojure\n(phrases (chunker (pos-tag (tokenize \"The override system is meant to deactivate the accelerator when the brake pedal is pressed.\"))))\n([\"The\" \"override\" \"system\"] [\"is\" \"meant\" \"to\" \"deactivate\"] [\"the\" \"accelerator\"] [\"when\"] [\"the\" \"brake\" \"pedal\"] [\"is\" \"pressed\"])\n```\n\nAnd with just strings:\n\n```clojure\n(phrase-strings (chunker (pos-tag (tokenize \"The override system is meant to deactivate the accelerator when the brake pedal is pressed.\"))))\n(\"The override system\" \"is meant to deactivate\" \"the accelerator\" \"when\" \"the brake pedal\" \"is pressed\")\n```\n\nDocument Categorization:\n\nSee opennlp.test.tools.train for better usage examples.\n\n```clojure\n(def doccat (make-document-categorizer \"my-doccat-model\"))\n\n(doccat \"This is some good text\")\n\"Happy\"\n```\n\nProbabilities of confidence\n---------------------------\n\nThe probabilities OpenNLP supplies for a given operation are available\nas metadata on the result, where applicable:\n\n```clojure\n(meta (get-sentences \"This is a sentence. This is also one.\"))\n{:probabilities (0.9999054310803004 0.9941126097177366)}\n\n(meta (tokenize \"This is a sentence.\"))\n{:probabilities (1.0 1.0 1.0 0.9956236737394807 1.0)}\n\n(meta (pos-tag [\"This\" \"is\" \"a\" \"sentence\" \".\"]))\n{:probabilities (0.9649410482478001 0.9982592902509803 0.9967282012835504 0.9952498677248117 0.9862225658078769)}\n\n(meta (chunker (pos-tag [\"This\" \"is\" \"a\" \"sentence\" \".\"])))\n{:probabilities (0.9941248001899835 0.9878092935921453 0.9986106511439116 0.9972975733070356 0.9906377695586069)}\n\n(meta (name-find [\"My\" \"name\" \"is\" \"John\"]))\n{:probabilities (0.9996272005494383 0.999999997485361 0.9999948113868132 0.9982291838206192)}\n```\n\n\n\nBeam Size\n---------\n\nYou can rebind ```opennlp.nlp/*beam-size*``` (the default is 3) for\nthe pos-tagger and treebank-parser with:\n\n```clojure\n(binding [*beam-size* 1]\n  (def pos-tag (make-pos-tagger \"models/en-pos-maxent.bin\")))\n```\n\n\nAdvance Percentage\n---------\n\nYou can rebind ```opennlp.treebank/*advance-percentage*``` (the default is 0.95) for\nthe treebank-parser with:\n\n```clojure\n(binding [*advance-percentage* 0.80]\n  (def parser (make-treebank-parser \"parser-model/en-parser-chunking.bin\")))\n```\n\n\nTreebank-parsing\n----------------\n\n\u003cb\u003eNote: Treebank parsing is very memory intensive, make sure your JVM has\na sufficient amount of memory available (using something like -Xmx512m)\nor you will run out of heap space when using a treebank parser.\u003c/b\u003e\n\nTreebank parsing gets its own section due to how complex it is.\n\nNote none of the treebank-parser model is included in the git repo, you will\nhave to download it separately from the opennlp project.\n\nCreating it:\n\n```clojure\n(def treebank-parser (make-treebank-parser \"parser-model/en-parser-chunking.bin\"))\n```\n\nTo use the treebank-parser, pass an array of sentences with their tokens\nseparated by whitespace (preferably using tokenize)\n\n```clojure\n(treebank-parser [\"This is a sentence .\"])\n[\"(TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN sentence))) (. .)))\"]\n```\n\nIn order to transform the treebank-parser string into something a little easier\nfor Clojure to perform on, use the (make-tree ...) function:\n\n```clojure\n(make-tree (first (treebank-parser [\"This is a sentence .\"])))\n{:chunk {:chunk ({:chunk {:chunk \"This\", :tag DT}, :tag NP} {:chunk ({:chunk \"is\", :tag VBZ} {:chunk ({:chunk \"a\", :tag DT} {:chunk \"sentence\", :tag NN}), :tag NP}), :tag VP} {:chunk \".\", :tag .}), :tag S}, :tag TOP}\n```\n\nHere's the datastructure split into a little more readable format:\n\n```clojure\n{:tag TOP\n :chunk {:tag S\n         :chunk ({:tag NP\n                  :chunk {:tag DT\n                          :chunk \"This\"}}\n                 {:tag VP\n                  :chunk ({:tag VBZ\n                           :chunk \"is\"}\n                          {:tag NP\n                           :chunk ({:tag DT\n                                    :chunk \"a\"}\n                                   {:tag NN\n                                    :chunk \"sentence\"})})}\n                 {:tag .\n                  :chunk \".\"})}}\n```\n\nHopefully that makes it a little bit clearer, a nested map. If anyone else has\nany suggesstions for better ways to represent this information, feel free to\nsend me an email or a patch.\n\nTreebank parsing is considered beta at this point.\n\n\nFilters\n=======\n\nFiltering pos-tagged sequences\n------------------------------\n\n```clojure\n(use 'opennlp.tools.filters)\n\n(pprint (nouns (pos-tag (tokenize \"Mr. Smith gave a car to his son on Friday.\"))))\n([\"Mr.\" \"NNP\"]\n [\"Smith\" \"NNP\"]\n [\"car\" \"NN\"]\n [\"son\" \"NN\"]\n [\"Friday\" \"NNP\"])\n\n(pprint (verbs (pos-tag (tokenize \"Mr. Smith gave a car to his son on Friday.\"))))\n([\"gave\" \"VBD\"])\n```\n\nFiltering treebank-chunks\n-------------------------\n\n```clojure\n(use 'opennlp.tools.filters)\n\n(pprint (noun-phrases (chunker (pos-tag (tokenize \"The override system is meant to deactivate the accelerator when the brake pedal is pressed\")))))\n({:phrase [\"The\" \"override\" \"system\"], :tag \"NP\"}\n {:phrase [\"the\" \"accelerator\"], :tag \"NP\"}\n {:phrase [\"the\" \"brake\" \"pedal\"], :tag \"NP\"})\n```\n\nCreating your own filters:\n--------------------------\n\n```clojure\n(pos-filter determiners #\"^DT\")\n#'user/determiners\n(doc determiners)\n-------------------------\nuser/determiners\n([elements__52__auto__])\n  Given a list of pos-tagged elements, return only the determiners in a list.\n\n(pprint (determiners (pos-tag (tokenize \"Mr. Smith gave a car to his son on Friday.\"))))\n([\"a\" \"DT\"])\n```\n\nYou can also create treebank-chunk filters using (chunk-filter ...)\n\n```clojure\n(chunk-filter fragments #\"^FRAG$\")\n\n(doc fragments)\n-------------------------\nopennlp.nlp/fragments\n([elements__178__auto__])\n  Given a list of treebank-chunked elements, return only the fragments in a list.\n```\n\n\nBeing Lazy\n==========\n\nThere are some methods to help you be lazy when tagging methods, depending on the operation desired,\nuse the corresponding method:\n\n    #'opennlp.tools.lazy/lazy-get-sentences\n    #'opennlp.tools.lazy/lazy-tokenize\n    #'opennlp.tools.lazy/lazy-tag\n    #'opennlp.tools.lazy/lazy-chunk\n    #'opennlp.tools.lazy/sentence-seq\n\nHere's how to use them:\n\n```clojure\n(use 'opennlp.nlp)\n(use 'opennlp.treebank)\n(use 'opennlp.tools.lazy)\n\n(def get-sentences (make-sentence-detector \"models/en-sent.bin\"))\n(def tokenize (make-tokenizer \"models/en-token.bin\"))\n(def pos-tag (make-pos-tagger \"models/en-pos-maxent.bin\"))\n(def chunker (make-treebank-chunker \"models/en-chunker.bin\"))\n\n(lazy-get-sentences [\"This body of text has three sentences. This is the first. This is the third.\" \"This body has only two. Here's the last one.\"] get-sentences)\n;; will lazily return:\n([\"This body of text has three sentences. \" \"This is the first. \" \"This is the third.\"] [\"This body has only two. \" \"Here's the last one.\"])\n\n(lazy-tokenize [\"This is a sentence.\" \"This is another sentence.\" \"This is the third.\"] tokenize)\n;; will lazily return:\n([\"This\" \"is\" \"a\" \"sentence\" \".\"] [\"This\" \"is\" \"another\" \"sentence\" \".\"] [\"This\" \"is\" \"the\" \"third\" \".\"])\n\n(lazy-tag [\"This is a sentence.\" \"This is another sentence.\"] tokenize pos-tag)\n;; will lazily return:\n(([\"This\" \"DT\"] [\"is\" \"VBZ\"] [\"a\" \"DT\"] [\"sentence\" \"NN\"] [\".\" \".\"]) ([\"This\" \"DT\"] [\"is\" \"VBZ\"] [\"another\" \"DT\"] [\"sentence\" \"NN\"] [\".\" \".\"]))\n\n(lazy-chunk [\"This is a sentence.\" \"This is another sentence.\"] tokenize pos-tag chunker)\n;; will lazily return:\n(({:phrase [\"This\"], :tag \"NP\"} {:phrase [\"is\"], :tag \"VP\"} {:phrase [\"a\" \"sentence\"], :tag \"NP\"}) ({:phrase [\"This\"], :tag \"NP\"} {:phrase [\"is\"], :tag \"VP\"} {:phrase [\"another\" \"sentence\"], :tag \"NP\"}))\n```\n\nFeel free to use the lazy functions, but I'm still not 100% set on the\nlayout, so they may change in the future. (Maybe chaining them so\ninstead of a sequence of sentences it looks like (lazy-chunk (lazy-tag\n(lazy-tokenize (lazy-get-sentences ...))))).\n\n\u003cb\u003eGenerating a lazy sequence of sentences from a file using\nopennlp.tools.lazy/sentence-seq:\u003c/b\u003e\n\n```clojure\n(with-open [rdr (clojure.java.io/reader \"/tmp/bigfile\")]\n  (let [sentences (sentence-seq rdr get-sentences)]\n    ;; process your lazy seq of sentences however you desire\n    (println \"first 5 sentences:\")\n    (clojure.pprint/pprint (take 5 sentences))))\n```\n\n\nTraining\n--------\nThere is code to allow for training models for each of the\ntools. Please see the documentation in TRAINING.markdown\n\n\nLicense\n-------\nCopyright (C) 2010 Matthew Lee Hinman\n\nDistributed under the Eclipse Public License, the same as Clojure uses. See the file COPYING.\n\n\nContributors\n------------\n- Rob Zinkov - zaxtax\n- Alexandre Patry - apatry\n\n\nTODO\n----\n- \u003cdel\u003eadd method to generate lazy sequence of sentences from a file\u003c/del\u003e (done!)\n- \u003cdel\u003eDetokenizer\u003c/del\u003e (still more work to do, but it works for now)\n- Do something with parse-num for treebank parsing\n- \u003cdel\u003eSplit up treebank stuff into its own namespace\u003c/del\u003e (done!)\n- \u003cdel\u003eTreebank chunker\u003c/del\u003e (done!)\n- \u003cdel\u003eTreebank parser\u003c/del\u003e (done!)\n- \u003cdel\u003eLaziness \u003c/del\u003e (done! for now.)\n- Treebank linker (WIP)\n- \u003cdel\u003ePhrase helpers for chunker\u003c/del\u003e (done!)\n- \u003cdel\u003eFigure out what license to use.\u003c/del\u003e (done!)\n- Filters for treebank-parser\n- Return multiple probability results for treebank-parser\n- \u003cdel\u003eExplore including probability numbers\u003c/del\u003e (probability numbers added as metadata)\n- \u003cdel\u003eModel training/trainer\u003c/del\u003e (done!)\n- Revisit datastructure format for tagged sentences\n- \u003cdel\u003eDocument *beam-size* functionality\u003c/del\u003e\n- \u003cdel\u003eDocument *advance-percentage* functionality\u003c/del\u003e\n- Build a full test suite:\n-- \u003cdel\u003ecore tools\u003c/del\u003e (done)\n-- \u003cdel\u003efilters\u003c/del\u003e (done)\n-- \u003cdel\u003elaziness\u003c/del\u003e (done)\n-- training (pretty much done except for tagging)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdakrone%2Fclojure-opennlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdakrone%2Fclojure-opennlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdakrone%2Fclojure-opennlp/lists"}