Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/zencephalon/tactful_tokenizer

Accurate Bayesian sentence tokenizer in Ruby.
https://github.com/zencephalon/tactful_tokenizer

nlp ruby rubynlp

Last synced: 1 day ago
JSON representation

Accurate Bayesian sentence tokenizer in Ruby.

Awesome Lists containing this project

README

        

= TactfulTokenizer

{Gem Version}[http://badge.fury.io/rb/tactful_tokenizer]
{Build Status}[https://travis-ci.org/zencephalon/Tactful_Tokenizer]
{}[https://codeclimate.com/github/zencephalon/Tactful_Tokenizer]
{Coverage Status}[https://coveralls.io/r/zencephalon/Tactful_Tokenizer?branch=release]

TactfulTokenizer is a Ruby library for high quality sentence
tokenization. It uses a Naive Bayesian statistical model, and
is based on Splitta[http://code.google.com/p/splitta/], but
has support for '?' and '!' as well as primitive handling of
XHTML markup. Better support for XHTML parsing is coming shortly.

Additionally supports unicode text tokenization.

== Usage

require "tactful_tokenizer"
m = TactfulTokenizer::Model.new
m.tokenize_text("Here in the U.S. Senate we prefer to eat our friends. Is it easier that way? Yes. Maybe!")
#=> ["Here in the U.S. Senate we prefer to eat our friends.", "Is it easier that way?", "Yes.", "Maybe!"]

The input text is expected to consist of paragraphs delimited
by line breaks.

== Installation
gem install tactful_tokenizer

== Author

Copyright (c) 2010 Matthew Bunday. All rights reserved.
Released under the {GNU GPL v3}[http://www.gnu.org/licenses/gpl.html].