{"id":19031479,"url":"https://github.com/plandes/clj-nlp-feature","last_synced_at":"2026-05-03T07:30:19.797Z","repository":{"id":80107487,"uuid":"69512774","full_name":"plandes/clj-nlp-feature","owner":"plandes","description":"Natural Language Feature Creation","archived":false,"fork":false,"pushed_at":"2018-06-20T01:44:40.000Z","size":5950,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-02T04:14:38.567Z","etag":null,"topics":["clojure","feature-engineering","machine-learning","natural-language","natural-language-processing"],"latest_commit_sha":null,"homepage":null,"language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/plandes.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-09-28T23:46:09.000Z","updated_at":"2018-08-24T05:59:36.000Z","dependencies_parsed_at":null,"dependency_job_id":"695e104b-d3f9-4472-bde3-5706f8760901","html_url":"https://github.com/plandes/clj-nlp-feature","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plandes%2Fclj-nlp-feature","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plandes%2Fclj-nlp-feature/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plandes%2Fclj-nlp-feature/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plandes%2Fclj-nlp-feature/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/plandes","download_url":"https://codeload.github.com/plandes/clj-nlp-feature/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240080780,"owners_count":19744927,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clojure","feature-engineering","machine-learning","natural-language","natural-language-processing"],"created_at":"2024-11-08T21:23:46.807Z","updated_at":"2026-05-03T07:30:19.757Z","avatar_url":"https://github.com/plandes.png","language":"Clojure","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Natural Language Feature Creation\n\n[![Travis CI Build Status][travis-badge]][travis-link]\n\n  [travis-link]: https://travis-ci.org/plandes/clj-nlp-feature\n  [travis-badge]: https://travis-ci.org/plandes/clj-nlp-feature.svg?branch=master\n\nThis library provides simple character and token based feature creation\nfunctions.  For additiona feature libraries and examples of how to use this\nlibrary see the [NLP parse library](https://github.com/plandes/clj-nlp-parse).\n\nFeatures (creation):\n* [WordNet]\n  * [Dictionary features](https://plandes.github.io/clj-nlp-feature/codox/zensols.nlparse.feature.word.html#var-dictionary-features)\n* [Token statistics]:\n  * Average character length\n  * Mention count\n  * Sentence count\n  * Stopword count\n  * Interrogative indication\n* [Character statistics]:\n  * Capital tokens\n  * Punctuation\n  * Unicode\n  * Repeating characters\n  * Latin vs. Non-latin character sets\n* [Feature utilities]\n  * End/begin of sentence\n  * Ratio functions\n\n\u003c!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc --\u003e\n## Table of Contents\n\n- [Obtaining](#obtaining)\n- [Documentation](#documentation)\n- [Usage](#usage)\n- [Citation](#citation)\n- [Building](#building)\n- [Changelog](#changelog)\n- [Citation](#citation-1)\n- [References](#references)\n- [License](#license)\n\n\u003c!-- markdown-toc end --\u003e\n\n\n## Obtaining\n\nIn your `project.clj` file, add:\n\n[![Clojars Project](https://clojars.org/com.zensols.nlp/feature/latest-version.svg)](https://clojars.org/com.zensols.nlp/feature/)\n\n## Documentation\n\n\nAPI documentation:\n* [Clojure](https://plandes.github.io/clj-nlp-feature/codox/index.html)\n* [Java](https://plandes.github.io/clj-nlp-feature/apidocs/index.html)\n\n\n## Usage\n\nThe following illustrates how to create character and token based features:\n```clojure\n(:require [zensols.nlparse.feature.char :as cf]\n          [zensols.nlparse.feature.word :as w])\n\n(defn- tokenize [utterance]\n  (-\u003e\u003e (s/split utterance #\"\\s+\")\n       (map #(hash-map :text %))))\n\n(defn calc-feature-1 [tokens]\n  (log/debugf \"calculating features for \u003c%s\u003e\" (pr-str tokens))\n  (merge (cf/capital-features tokens)\n         (cf/unicode-features tokens 1)))\n\n;; in a different namespace to calculate features for a different model...\n(defn calc-feature-2 [tokens]\n  (log/debugf \"calculating features for \u003c%s\u003e\" (pr-str tokens))\n  (merge (cf/capital-features tokens)    \n         (w/dictionary-features tokens)))\n\n(let [tokens (-\u003e\u003e \"My name is Paul\" tokenize)\n      f1-features (calc-feature-1 tokens)\n      f2-features (calc-feature-2 tokens)]\n  (clojure.pprint/pprint {:f1 f1-features\n                          :f2 f2-features}))\n```\n\nIn this example, we're creating features for two different models in the\n`calc-features-*` functions.  This is common where there are some common\nfeatures between models.  However, we're recalculating the capital case\nfeatures in `cf/capital-features`.  We have to do this in case where our\nfeature generation is in different namespaces or even different libraries/jars.\n\nFortunately, this library provides a way to avoid recreating these features as\nshown below:\n```clojure\n(defn calc-feature-1 [tokens]\n  (log/debugf \"calculating features for \u003c%s\u003e\" (pr-str tokens))\n  (c/combine-features (cf/capital-features tokens)\n                      (cf/unicode-features tokens 1)))\n\n;; in a different namespace to calculate features for a different model...\n(defn calc-feature-2 [tokens]\n  (log/debugf \"calculating features for \u003c%s\u003e\" (pr-str tokens))\n  (c/combine-features (cf/capital-features tokens)    \n                      (w/dictionary-features tokens)))\n\n(let [tokens (-\u003e\u003e \"My name is Paul\" tokenize)\n      f1-features (calc-feature-1 tokens)\n      f2-features (calc-feature-2 tokens)]\n  (clojure.pprint/pprint {:f1 f1-features\n                          :f2 f2-features}))\n```\nWe replace `merge` with `c/combine-features`, which adds these features to an\natom with a map.  For those features that are already created, namely\n`cf/capital-features`, the function is not invoked a second time and uses the\nvalue in the map in the atom.\n\n\n## Citation\n\nThere are two utilities for looking up words:\n* [WordNet]: wraps [this library](http://extjwnl.sourceforge.net)\n* Word lists: English word lists taken from [this repo](https://github.com/dwyl/english-words)\n\nUsage of these libraries are available as features with the\n`dictionary-features` function found [here](https://plandes.github.io/clj-nlp-feature/codox/zensols.nlparse.feature.word.html#var-dictionary-features).\n\nAll other [word lists](ftp://ftp.gnu.org/gnu/aspell/dict/0index.html) come from\nthe [GNU Aspell](http://aspell.net) dictionaries.\n\n\n## Building\n\nTo build from source, do the folling:\n\n- Install [Leiningen](http://leiningen.org) (this is just a script)\n- Install [GNU make](https://www.gnu.org/software/make/)\n- Install [Git](https://git-scm.com)\n- Download the source: `git clone https://github.com/clj-nlp-feature \u0026\u0026 cd clj-nlp-feature`\n- Download the make include files:\n```bash\nmkdir ../clj-zenbuild \u0026\u0026 wget -O - https://api.github.com/repos/plandes/clj-zenbuild/tarball | tar zxfv - -C ../clj-zenbuild --strip-components 1\n```\n- Compile: `make compile` do compile or `make install` to install in your local\n  maven repo.\n\n\n## Changelog\n\nAn extensive changelog is available [here](CHANGELOG.md).\n\n\n## Citation\n\nIf you use this software in your research, please cite with the following\nBibTeX:\n\n```jflex\n@misc{plandes-clj-nlp-feature,\n  author = {Paul Landes},\n  title = {Natural Language Feature Creation},\n  year = {2018},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/plandes/clj-nlp-feature}}\n}\n```\n\n\n## References\n\n```jflex\n@Book{wordnet1998,\n  title = {WordNet: An Electronic Lexical Database},\n  author = {Christiane Fellbaum},\n  year = {1998},\n  publisher = {Bradford Books},\n}\n```\n\n\n## License\n\nCopyright © 2016, 2017, 2018 Paul Landes\n\nPermission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n\n\n\u003c!-- links --\u003e\n[WordNet]: https://wordnet.princeton.edu\n[Token statistics]: https://plandes.github.io/clj-nlp-feature/codox/zensols.nlparse.feature.word.html\n[Character statistics]: https://plandes.github.io/clj-nlp-feature/codox/zensols.nlparse.feature.char.html\n[Feature utilities]: https://plandes.github.io/clj-nlp-feature/codox/zensols.nlparse.feature.util.html\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplandes%2Fclj-nlp-feature","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fplandes%2Fclj-nlp-feature","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplandes%2Fclj-nlp-feature/lists"}