{"id":18717276,"url":"https://github.com/thisiscetin/textoken","last_synced_at":"2025-07-09T23:33:22.857Z","repository":{"id":62558845,"uuid":"43002956","full_name":"thisiscetin/textoken","owner":"thisiscetin","description":"Simple and customizable text tokenization gem.","archived":false,"fork":false,"pushed_at":"2021-09-28T16:08:31.000Z","size":97,"stargazers_count":31,"open_issues_count":0,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-07-05T03:45:41.173Z","etag":null,"topics":["nlp","ruby","rubynlp","tokenization"],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thisiscetin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-09-23T13:34:29.000Z","updated_at":"2023-02-22T12:47:28.000Z","dependencies_parsed_at":"2022-11-03T11:15:19.817Z","dependency_job_id":null,"html_url":"https://github.com/thisiscetin/textoken","commit_stats":null,"previous_names":["manorie/textoken","c7n0/textoken"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/thisiscetin/textoken","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thisiscetin%2Ftextoken","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thisiscetin%2Ftextoken/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thisiscetin%2Ftextoken/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thisiscetin%2Ftextoken/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thisiscetin","download_url":"https://codeload.github.com/thisiscetin/textoken/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thisiscetin%2Ftextoken/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264505262,"owners_count":23618911,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nlp","ruby","rubynlp","tokenization"],"created_at":"2024-11-07T13:15:37.942Z","updated_at":"2025-07-09T23:33:22.834Z","avatar_url":"https://github.com/thisiscetin.png","language":"Ruby","readme":"# [Textoken](//github.com/manorie/textoken)\n\n[![Build Status](https://travis-ci.org/manorie/textoken.svg?branch=development)](https://travis-ci.org/manorie/textoken?branch=development)\n[![Coverage Status](https://coveralls.io/repos/manorie/textoken/badge.svg?branch=development\u0026service=github)](https://coveralls.io/github/manorie/textoken?branch=development)\n[![Code Climate](https://codeclimate.com/github/manorie/textoken/badges/gpa.svg)](https://codeclimate.com/github/manorie/textoken)\n[![Gem Version](https://badge.fury.io/rb/textoken.svg)](http://badge.fury.io/rb/textoken)\n\nTextoken is a Ruby library for text tokenization. This gem extracts words from text with many customizations. It can be used in many fields like Web Crawling and Natural Language Processing.\n\n## Basic Usage\n\n```ruby\nrequire 'textoken'\n\nTextoken('Software is like sex: it\\'s better when it\\'s free. \\'Linus Tolvards\\'').tokens\n# =\u003e [\"Software\", \"is\", \"like\", \"sex\", \":\", \"it\", \"'\", \"s\", \"better\", \"when\", \"it\", \"'\", \"s\", \"free\", \".\", \"'\", \"Linus\", \"Tolvards\", \"'\"]\n\nTextoken('Oh, no! Alfa is at home.').tokens\n# =\u003e [\"Oh\", \",\", \"no\", \"!\", \"Alfa\", \"is\", \"at\", \"home\", \".\"]\n\nTextoken('Oh, no! Alfa is at home.').words\n# =\u003e [\"Oh,\", \"no!\", \"Alfa\", \"is\", \"at\", \"home.\"]\n```\n\n## Customization\n\n```ruby\nrequire 'textoken'\n\nTextoken('Oh, no! Alfa is at home.', only: 'punctuations').tokens\n# =\u003e [\"Oh\", \",\", \"no\", \"!\", \"home\", \".\"]\n\nTextoken('Oh, no! Alfa is at home.', exclude: 'punctuations', more_than: 3).tokens\n# =\u003e [\"Alfa\"]\n\nTextoken('Oh, no! Alfa is at 01/01/2000 with $1000.', only: 'dates, numerics').words\n# =\u003e [\"01/01/2000\", \"$1000.\"]\n\nTextoken('Oh, no! Alfa 2000 is at home.', only_regexp: '^[0-9]*$').tokens\n# =\u003e [\"2000\"]\n```\n\nYou can combine all options. 'Only' and 'Exclude' Options support multiple option values like **only: 'punctuations, dates, numerics'**\n\nPublic interface of Textoken presents two methods, **tokens** \u0026 **words**\n\n```ruby\nTextoken('Alfa.').tokens\n# =\u003e [\"Alfa\", \".\"]\n# =\u003e splits punctuations by default whereas,\n\nTextoken('Alfa.').words\n# =\u003e [\"Alfa.\"]\n# =\u003e does not split punctuations.\n```\n\n## Current Options\n\n- **only:** Accepts any regexp defined in [option_values.yml](//github.com/manorie/textoken/blob/development/lib/textoken/regexps/option_values.yml)\n\n- **only_regexp:** Accepts any regexp but only one regexp can be given.\n\n- **exclude:** Accepts any regexp defined in [option_values.yml](https://github.com/manorie/textoken/blob/development/lib/textoken/regexps/option_values.yml)\n\n- **exclude_regexp** Accepts any regexp but only one regexp can be given.\n\n- **less_than:** Accepts any integer bigger than 1.\n\n- **more_than:** Accepts any positive integer.\n\n## Option Meanings\n\n- **only:** If a word in text consist of a regexp or regexps, only option includes it in result.\n\n- **only_regexp:** If a word in text consist of user given regexp, only_regexp option includes it in result.\n\n- **exclude:** If a word in text does not have a regexp at some part, exclude option excludes it from result. Opposite of only.\n\n- **exclude_regexp:** If a word in text does not have user given regexp at some part, exclude option excludes it from result. Opposite of only_regexp.\n\n- **less_than:** Filters result by the word length less than the option value given.\n\n- **more_than:** Filters result by the word length bigger than the option value given.\n\n\n## Installation\n\nAdd this line to your application's Gemfile:\n\n    gem 'textoken'\n\nAnd then execute:\n\n    $ bundle\n\nOr install it yourself as:\n\n    $ gem install textoken\n\n\n## Supported Ruby Versions\n\nThis library aims to support and is tested against the following Ruby\nimplementations:\n\n* Ruby 2.0.0\n* Ruby 2.1\n* Ruby 2.2.5\n* Ruby 2.3.1\n* Ruby 2.4.6\n* Ruby 2.5.5\n* Ruby 2.6.3\n* Ruby ruby-head\n\n* [JRuby](http://jruby.org/)\n\nIf something doesn't work on one of these versions, it's a bug.\nThis library may also work (or seem to work) on other Ruby versions or implementations, however support will only be provided for the implementations listed above.\n\n## Contributing\n\nFeel free to add any regepx to lib/regexps/option_values.yml but please add a simple test to 'single options' part at textoken_spec.rb\n\n1. Fork it\n2. Create your feature branch (`git checkout -b my-new-feature`)\n3. Commit your changes (`git commit -am 'Add some feature'`)\n4. Push to the branch (`git push origin my-new-feature`)\n5. Create new Pull Request\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthisiscetin%2Ftextoken","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthisiscetin%2Ftextoken","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthisiscetin%2Ftextoken/lists"}