{"id":21815580,"url":"https://github.com/arbox/tokenizer","last_synced_at":"2026-03-03T10:31:19.984Z","repository":{"id":56916244,"uuid":"2256010","full_name":"arbox/tokenizer","owner":"arbox","description":"A simple tokenizer in Ruby for NLP tasks.","archived":false,"fork":false,"pushed_at":"2017-04-03T05:46:02.000Z","size":97,"stargazers_count":46,"open_issues_count":5,"forks_count":11,"subscribers_count":3,"default_branch":"master","last_synced_at":"2026-02-23T04:31:44.699Z","etag":null,"topics":["natural-language-processing","nlp","ruby","rubynlp","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arbox.png","metadata":{"files":{"readme":"README.rdoc","changelog":"CHANGELOG.rdoc","contributing":null,"funding":null,"license":"LICENSE.rdoc","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2011-08-23T15:38:14.000Z","updated_at":"2025-11-19T19:28:59.000Z","dependencies_parsed_at":"2022-08-21T03:50:47.317Z","dependency_job_id":null,"html_url":"https://github.com/arbox/tokenizer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/arbox/tokenizer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arbox%2Ftokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arbox%2Ftokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arbox%2Ftokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arbox%2Ftokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arbox","download_url":"https://codeload.github.com/arbox/tokenizer/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arbox%2Ftokenizer/sbom","scorecard":{"id":205324,"data":{"date":"2025-08-11","repo":{"name":"github.com/arbox/tokenizer","commit":"ad98badc6260aef08e65f1ecac43c712301336b6"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"License","score":9,"reason":"license file detected","details":["Info: project has a license file: LICENSE.rdoc:0","Warn: project license file does not contain an FSF or OSI license."],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}}]},"last_synced_at":"2025-08-16T23:36:03.838Z","repository_id":56916244,"created_at":"2025-08-16T23:36:03.838Z","updated_at":"2025-08-16T23:36:03.838Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29976279,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-01T16:35:47.903Z","status":"ssl_error","status_checked_at":"2026-03-01T16:35:44.899Z","response_time":124,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["natural-language-processing","nlp","ruby","rubynlp","tokenizer"],"created_at":"2024-11-27T15:20:02.070Z","updated_at":"2026-03-03T10:31:19.958Z","avatar_url":"https://github.com/arbox.png","language":"Ruby","readme":"= Tokenizer\n\n{RubyGems}[http://rubygems.org/gems/tokenizer] |\n{Homepage}[http://bu.chsta.be/projects/tokenizer] |\n{Source Code}[https://github.com/arbox/tokenizer] |\n{Bug Tracker}[https://github.com/arbox/tokenizer/issues]\n\n{\u003cimg src=\"https://img.shields.io/gem/v/tokenizer.svg\" alt=\"Gem Version\" /\u003e}[https://rubygems.org/gems/tokenizer]\n{\u003cimg src=\"https://img.shields.io/travis/arbox/tokenizer.svg\" alt=\"Build Status\" /\u003e}[https://travis-ci.org/arbox/tokenizer]\n{\u003cimg src=\"https://img.shields.io/codeclimate/github/arbox/tokenizer.svg\" alt=\"Code Climate\" /\u003e}[https://codeclimate.com/github/arbox/tokenizer]\n{\u003cimg src=\"https://img.shields.io/gemnasium/arbox/tokenizer.svg\" alt=\"Dependency Status\" /\u003e}[https://gemnasium.com/arbox/tokenizer]\n\n== DESCRIPTION\nA simple multilingual tokenizer -- a linguistic tool intended to split a written text\ninto tokens for NLP tasks. This tool provides a CLI and a library for\nlinguistic tokenization which is an anavoidable step for many HLT (Human\nLanguage Technology) tasks in the preprocessing phase for further syntactic,\nsemantic  and other higher level processing goals.\n\nTokenization task involves Sentence Segmentation, Word Segmentation and Boundary\nDisambiguation for the both tasks.\n\nUse it for tokenization of German, English and Dutch texts.\n\n=== Implemented Algorithms\nto be ...\n\n== INSTALLATION\n+Tokenizer+ is provided as a .gem package. Simply install it via\n{RubyGems}[http://rubygems.org/gems/tokenizer].\n\nTo install +tokenizer+ issue the following command:\n  $ gem install tokenizer\n\nIf you want to do a system wide installation, do this as root\n(possibly using +sudo+).\n\nAlternatively use your Gemfile for dependency management.\n\n== SYNOPSIS\n\nYou can use +Tokenizer+ in two ways.\n* As a command line tool:\n    $ echo 'Hi, ich gehe in die Schule!. | tokenize\n\n* As a library for embedded tokenization:\n    \u003e require 'tokenizer'\n    \u003e de_tokenizer = Tokenizer::WhitespaceTokenizer.new\n    \u003e de_tokenizer.tokenize('Ich gehe in die Schule!')\n    \u003e =\u003e [\"Ich\", \"gehe\", \"in\", \"die\", \"Schule\", \"!\"]\n\n* Customizable PRE and POST list\n    \u003e require 'tokenizer'\n    \u003e de_tokenizer = Tokenizer::WhitespaceTokenizer.new(:de, { post: Tokenizer::Tokenizer::POST + ['|'] })\n    \u003e de_tokenizer.tokenize('Ich gehe|in die Schule!')\n    \u003e =\u003e [\"Ich\", \"gehe\", \"|in\", \"die\", \"Schule\", \"!\"]\n\nSee documentation in the Tokenizer::WhitespaceTokenizer class for details\non particular methods.\n\n== SUPPORT\n\nIf you have question, bug reports or any suggestions, please drop me an email :)\nAny help is deeply appreciated!\n\n== CHANGELOG\nFor details on future plan and working progress see CHANGELOG.rdoc.\n\n== CAUTION\nThis library is \u003cb\u003ework in process\u003c/b\u003e! Though the interface is mostly complete,\nyou might face some not implemented features.\n\nPlease contact me with your suggestions, bug reports and feature requests.\n\n== LICENSE\n\n+Tokenizer+ is a copyrighted software by Andrei Beliankou, 2011-\n\nYou may use, redistribute and change it under the terms provided\nin the LICENSE.rdoc file.\n","funding_links":[],"categories":["Language Parsing Tools","NLP Pipeline Subtasks"],"sub_categories":["NLP / NLU","Segmentation"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farbox%2Ftokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farbox%2Ftokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farbox%2Ftokenizer/lists"}