{"id":13412006,"url":"https://github.com/abitdodgy/words_counted","last_synced_at":"2025-12-24T06:14:04.794Z","repository":{"id":16545629,"uuid":"19299241","full_name":"abitdodgy/words_counted","owner":"abitdodgy","description":"A Ruby natural language processor.","archived":false,"fork":false,"pushed_at":"2021-10-28T12:40:38.000Z","size":103,"stargazers_count":159,"open_issues_count":7,"forks_count":29,"subscribers_count":12,"default_branch":"master","last_synced_at":"2024-10-04T19:16:20.129Z","etag":null,"topics":["natural-language-processing","nlp","ruby","rubynlp","word-counter","wordcount","wordscounter"],"latest_commit_sha":null,"homepage":"http://rubywordcount.com","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abitdodgy.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-04-30T03:07:05.000Z","updated_at":"2024-04-15T22:06:42.000Z","dependencies_parsed_at":"2022-09-09T18:00:19.145Z","dependency_job_id":null,"html_url":"https://github.com/abitdodgy/words_counted","commit_stats":null,"previous_names":[],"tags_count":19,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abitdodgy%2Fwords_counted","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abitdodgy%2Fwords_counted/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abitdodgy%2Fwords_counted/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abitdodgy%2Fwords_counted/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abitdodgy","download_url":"https://codeload.github.com/abitdodgy/words_counted/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243618717,"owners_count":20320282,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["natural-language-processing","nlp","ruby","rubynlp","word-counter","wordcount","wordscounter"],"created_at":"2024-07-30T20:01:20.063Z","updated_at":"2025-12-24T06:14:04.787Z","avatar_url":"https://github.com/abitdodgy.png","language":"Ruby","funding_links":[],"categories":["Ruby","Natural Language Processing","NLP Pipeline Subtasks"],"sub_categories":["Lexical Processing"],"readme":"# WordsCounted\n\n\u003e We are all in the gutter, but some of us are looking at the stars.\n\u003e\n\u003e -- Oscar Wilde\n\nWordsCounted is a Ruby NLP (natural language processor). WordsCounted lets you implement powerful tokensation strategies with a very flexible tokeniser class.\n\n**Are you using WordsCounted to do something interesting?** Please [tell me about it][8].\n\n\u003ca href=\"http://badge.fury.io/rb/words_counted\"\u003e\n  \u003cimg src=\"https://badge.fury.io/rb/words_counted@2x.png\" alt=\"Gem Version\" height=\"18\"\u003e\n\u003c/a\u003e\n\n[RubyDoc documentation][7].\n\n### Demo\n\nVisit [this website][4] for one example of what you can do with WordsCounted.\n\n### Features\n\n* Out of the box, get the following data from any string or readable file, or URL:\n    * Token count and unique token count\n    * Token densities, frequencies, and lengths\n    * Char count and average chars per token\n    * The longest tokens and their lengths\n    * The most frequent tokens and their frequencies.\n* A flexible way to exclude tokens from the tokeniser. You can pass a **string**, **regexp**, **symbol**, **lambda**, or an **array** of any combination of those types for powerful tokenisation strategies.\n* Pass your own regexp rules to the tokeniser if you prefer. The default regexp filters special characters but keeps hyphens and apostrophes. It also plays nicely with diacritics (UTF and unicode characters): *Bayrūt* is treated as `[\"Bayrūt\"]` and not `[\"Bayr\", \"ū\", \"t\"]`, for example.\n* Opens and reads files. Pass in a file path or a url instead of a string.\n\n## Installation\n\nAdd this line to your application's Gemfile:\n\n    gem 'words_counted'\n\nAnd then execute:\n\n    $ bundle\n\nOr install it yourself as:\n\n    $ gem install words_counted\n\n## Usage\n\nPass in a string or a file path, and an optional filter and/or regexp.\n\n```ruby\ncounter = WordsCounted.count(\n  \"We are all in the gutter, but some of us are looking at the stars.\"\n)\n\n# Using a file\ncounter = WordsCounted.from_file(\"path/or/url/to/my/file.txt\")\n```\n\n`.count` and `.from_file` are convenience methods that take an input, tokenise it, and return an instance of `WordsCounted::Counter` initialized with the tokens. The `WordsCounted::Tokeniser` and `WordsCounted::Counter` classes can be used alone, however.\n\n## API\n\n### WordsCounted\n\n**`WordsCounted.count(input, options = {})`**\n\nTokenises input and initializes a `WordsCounted::Counter` object with the resulting tokens.\n\n```ruby\ncounter = WordsCounted.count(\"Hello Beirut!\")\n````\n\nAccepts two options: `exclude` and `regexp`. See [Excluding tokens from the analyser][5] and [Passing in a custom regexp][6] respectively.\n\n**`WordsCounted.from_file(path, options = {})`**\n\nReads and tokenises a file, and initializes a `WordsCounted::Counter` object with the resulting tokens.\n\n```ruby\ncounter = WordsCounted.from_file(\"hello_beirut.txt\")\n````\n\nAccepts the same options as `.count`.\n\n### Tokeniser\n\nThe tokeniser allows you to tokenise text in a variety of ways. You can pass in your own rules for tokenisation, and apply a powerful filter with any combination of rules as long as they can boil down into a lambda.\n\nOut of the box the tokeniser includes only alpha chars. Hyphenated tokens and tokens with apostrophes are considered a single token.\n\n**`#tokenise([pattern: TOKEN_REGEXP, exclude: nil])`**\n\n```ruby\ntokeniser = WordsCounted::Tokeniser.new(\"Hello Beirut!\").tokenise\n\n# With `exclude`\ntokeniser = WordsCounted::Tokeniser.new(\"Hello Beirut!\").tokenise(exclude: \"hello\")\n\n# With `pattern`\ntokeniser = WordsCounted::Tokeniser.new(\"I \u003c3 Beirut!\").tokenise(pattern: /[a-z]/i)\n```\n\nSee [Excluding tokens from the analyser][5] and [Passing in a custom regexp][6] for more information.\n\n### Counter\n\nThe `WordsCounted::Counter` class allows you to collect various statistics from an array of tokens.\n\n**`#token_count`**\n\nReturns the token count of a given string.\n\n```ruby\ncounter.token_count #=\u003e 15\n```\n\n**`#token_frequency`**\n\nReturns a sorted (unstable) two-dimensional array where each element is a token and its frequency. The array is sorted by frequency in descending order.\n\n```ruby\ncounter.token_frequency\n\n[\n  [\"the\", 2],\n  [\"are\", 2],\n  [\"we\",  1],\n  # ...\n  [\"all\", 1]\n]\n```\n\n**`#most_frequent_tokens`**\n\nReturns a hash where each key-value pair is a token and its frequency.\n\n```ruby\ncounter.most_frequent_tokens\n\n{ \"are\" =\u003e 2, \"the\" =\u003e 2 }\n```\n\n**`#token_lengths`**\n\nReturns a sorted (unstable) two-dimentional array where each element contains a token and its length. The array is sorted by length in descending order.\n\n```ruby\ncounter.token_lengths\n\n[\n  [\"looking\", 7],\n  [\"gutter\",  6],\n  [\"stars\",   5],\n  # ...\n  [\"in\",      2]\n]\n```\n\n**`#longest_tokens`**\n\nReturns a hash where each key-value pair is a token and its length.\n\n\n```ruby\ncounter.longest_tokens\n\n{ \"looking\" =\u003e 7 }\n```\n\n**`#token_density([ precision: 2 ])`**\n\nReturns a sorted (unstable) two-dimentional array where each element contains a token and its density as a float, rounded to a precision of two. The array is sorted by density in descending order. It accepts a `precision` argument, which must be a float.\n\n```ruby\ncounter.token_density\n\n[\n  [\"are\",     0.13],\n  [\"the\",     0.13],\n  [\"but\",     0.07 ],\n  # ...\n  [\"we\",      0.07 ]\n]\n```\n\n**`#char_count`**\n\nReturns the char count of tokens.\n\n```ruby\ncounter.char_count #=\u003e 76\n```\n\n**`#average_chars_per_token([ precision: 2 ])`**\n\nReturns the average char count per token rounded to two decimal places. Accepts a precision argument which defaults to two. Precision must be a float.\n\n```ruby\ncounter.average_chars_per_token #=\u003e 4\n```\n\n**`#uniq_token_count`**\n\nReturns the number of unique tokens.\n\n```ruby\ncounter.uniq_token_count #=\u003e 13\n```\n\n## Excluding tokens from the tokeniser\n\nYou can exclude anything you want from the input by passing the `exclude` option. The exclude option accepts a variety of filters and is extremely flexible.\n\n1. A *space-delimited* string. The filter will normalise the string.\n2. A regular expression.\n3. A lambda.\n4. A symbol that names a predicate method.  For example `:odd?`.\n5. An array of any combination of the above.\n\n```ruby\ntokeniser =\n  WordsCounted::Tokeniser.new(\n    \"Magnificent! That was magnificent, Trevor.\"\n  )\n\n# Using a string\ntokeniser.tokenise(exclude: \"was magnificent\")\n# =\u003e [\"that\", \"trevor\"]\n\n# Using a regular expression\ntokeniser.tokenise(exclude: /trevor/)\n# =\u003e [\"magnificent\", \"that\", \"was\", \"magnificent\"]\n\n# Using a lambda\ntokeniser.tokenise(exclude: -\u003e(t) { t.length \u003c 4 })\n# =\u003e [\"magnificent\", \"that\", \"magnificent\", \"trevor\"]\n\n# Using symbol\ntokeniser = WordsCounted::Tokeniser.new(\"Hello! محمد\")\ntokeniser.tokenise(exclude: :ascii_only?)\n# =\u003e [\"محمد\"]\n\n# Using an array\ntokeniser = WordsCounted::Tokeniser.new(\n  \"Hello! اسماءنا هي محمد، كارولينا، سامي، وداني\"\n)\ntokeniser.tokenise(\n  exclude: [:ascii_only?, /محمد/, -\u003e(t) { t.length \u003e 6}, \"و\"]\n)\n# =\u003e [\"هي\", \"سامي\", \"وداني\"]\n```\n\n## Passing in a custom regexp\n\nThe default regexp accounts for letters, hyphenated tokens, and apostrophes. This means *twenty-one* is treated as one token. So is *Mohamad's*.\n\n```ruby\n/[\\p{Alpha}\\-']+/\n```\n\nYou can pass your own criteria as a Ruby regular expression to split your string as desired.\n\nFor example, if you wanted to include numbers, you can override the regular expression:\n\n```ruby\ncounter = WordsCounted.count(\"Numbers 1, 2, and 3\", pattern: /[\\p{Alnum}\\-']+/)\ncounter.tokens\n#=\u003e [\"numbers\", \"1\", \"2\", \"and\", \"3\"]\n```\n\n## Opening and reading files\n\nUse the `from_file` method to open files. `from_file` accepts the same options as `.count`. The file path can be a URL.\n\n```ruby\ncounter = WordsCounted.from_file(\"url/or/path/to/file.text\")\n```\n\n## Gotchas\n\nA hyphen used in leu of an *em* or *en* dash will form part of the token. This affects the tokeniser algorithm.\n\n```ruby\ncounter = WordsCounted.count(\"How do you do?-you are well, I see.\")\ncounter.token_frequency\n\n[\n  [\"do\",   2],\n  [\"how\",  1],\n  [\"you\",  1],\n  [\"-you\", 1], # WTF, mate!\n  [\"are\",  1],\n  # ...\n]\n```\n\nIn this example `-you` and `you` are separate tokens. Also, the tokeniser does not include numbers by default. Remember that you can pass your own regular expression if the default behaviour does not fit your needs.\n\n### A note on case sensitivity\n\nThe program will normalise (downcase) all incoming strings for consistency and filters.\n\n## Roadmap\n\n### Ability to open URLs\n\n```ruby\ndef self.from_url\n  # open url and send string here after removing html\nend\n```\n\n## Contributors\n\nSee [contributors][3].\n\n## Contributing\n\n1. Fork it\n2. Create your feature branch (`git checkout -b my-new-feature`)\n3. Commit your changes (`git commit -am 'Add some feature'`)\n4. Push to the branch (`git push origin my-new-feature`)\n5. Create new Pull Request\n\n  [2]: http://www.rubydoc.info/gems/words_counted\n  [3]: https://github.com/abitdodgy/words_counted/graphs/contributors\n  [4]: http://rubywordcount.com\n  [5]: https://github.com/abitdodgy/words_counted#excluding-tokens-from-the-analyser\n  [6]: https://github.com/abitdodgy/words_counted#passing-in-a-custom-regexp\n  [7]: http://www.rubydoc.info/gems/words_counted/\n  [8]: https://github.com/abitdodgy/words_counted/issues/new\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabitdodgy%2Fwords_counted","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabitdodgy%2Fwords_counted","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabitdodgy%2Fwords_counted/lists"}