Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/arbox/tokenizer
A simple tokenizer in Ruby for NLP tasks.
https://github.com/arbox/tokenizer
natural-language-processing nlp ruby rubynlp tokenizer
Last synced: 3 months ago
JSON representation
A simple tokenizer in Ruby for NLP tasks.
- Host: GitHub
- URL: https://github.com/arbox/tokenizer
- Owner: arbox
- License: other
- Created: 2011-08-23T15:38:14.000Z (over 13 years ago)
- Default Branch: master
- Last Pushed: 2017-04-03T05:46:02.000Z (almost 8 years ago)
- Last Synced: 2024-05-01T23:12:42.342Z (10 months ago)
- Topics: natural-language-processing, nlp, ruby, rubynlp, tokenizer
- Language: Ruby
- Homepage:
- Size: 94.7 KB
- Stars: 45
- Watchers: 5
- Forks: 11
- Open Issues: 5
-
Metadata Files:
- Readme: README.rdoc
- Changelog: CHANGELOG.rdoc
- License: LICENSE.rdoc
Awesome Lists containing this project
README
= Tokenizer
{RubyGems}[http://rubygems.org/gems/tokenizer] |
{Homepage}[http://bu.chsta.be/projects/tokenizer] |
{Source Code}[https://github.com/arbox/tokenizer] |
{Bug Tracker}[https://github.com/arbox/tokenizer/issues]{}[https://rubygems.org/gems/tokenizer]
{}[https://travis-ci.org/arbox/tokenizer]
{}[https://codeclimate.com/github/arbox/tokenizer]
{}[https://gemnasium.com/arbox/tokenizer]== DESCRIPTION
A simple multilingual tokenizer -- a linguistic tool intended to split a written text
into tokens for NLP tasks. This tool provides a CLI and a library for
linguistic tokenization which is an anavoidable step for many HLT (Human
Language Technology) tasks in the preprocessing phase for further syntactic,
semantic and other higher level processing goals.Tokenization task involves Sentence Segmentation, Word Segmentation and Boundary
Disambiguation for the both tasks.Use it for tokenization of German, English and Dutch texts.
=== Implemented Algorithms
to be ...== INSTALLATION
+Tokenizer+ is provided as a .gem package. Simply install it via
{RubyGems}[http://rubygems.org/gems/tokenizer].To install +tokenizer+ issue the following command:
$ gem install tokenizerIf you want to do a system wide installation, do this as root
(possibly using +sudo+).Alternatively use your Gemfile for dependency management.
== SYNOPSIS
You can use +Tokenizer+ in two ways.
* As a command line tool:
$ echo 'Hi, ich gehe in die Schule!. | tokenize* As a library for embedded tokenization:
> require 'tokenizer'
> de_tokenizer = Tokenizer::WhitespaceTokenizer.new
> de_tokenizer.tokenize('Ich gehe in die Schule!')
> => ["Ich", "gehe", "in", "die", "Schule", "!"]* Customizable PRE and POST list
> require 'tokenizer'
> de_tokenizer = Tokenizer::WhitespaceTokenizer.new(:de, { post: Tokenizer::Tokenizer::POST + ['|'] })
> de_tokenizer.tokenize('Ich gehe|in die Schule!')
> => ["Ich", "gehe", "|in", "die", "Schule", "!"]See documentation in the Tokenizer::WhitespaceTokenizer class for details
on particular methods.== SUPPORT
If you have question, bug reports or any suggestions, please drop me an email :)
Any help is deeply appreciated!== CHANGELOG
For details on future plan and working progress see CHANGELOG.rdoc.== CAUTION
This library is work in process! Though the interface is mostly complete,
you might face some not implemented features.Please contact me with your suggestions, bug reports and feature requests.
== LICENSE
+Tokenizer+ is a copyrighted software by Andrei Beliankou, 2011-
You may use, redistribute and change it under the terms provided
in the LICENSE.rdoc file.