An open API service indexing awesome lists of open source software.

https://github.com/louismullie/scalpel

A fast and accurate rule-based sentence segmentation tool for Ruby.
https://github.com/louismullie/scalpel

Last synced: 7 months ago
JSON representation

A fast and accurate rule-based sentence segmentation tool for Ruby.

Awesome Lists containing this project

README

          

[![Build Status](https://secure.travis-ci.org/louismullie/scalpel.png)](http://travis-ci.org/#!/louismullie/scalpel)

**About**

Scalpel is the result of my inability to find a simple and elegant solution to sentence segmentation in Ruby. Machine learning approaches - both unsupervised ([punkt-segmenter](https://github.com/lfcipriani/punkt-segmenter)) and supervised ( [tactful_tokenizer](https://github.com/SlyShy/Tactful_Tokenizer)) - depend on proper domain-specific training to work well. Stanford's tokenize-first group-later method ([stanford-core-nlp](https://github.com/louismullie/stanford-core-nlp)) does not work so well in the face of ill-formatted content. Finally, extensive rule-based methods ([srx-english](https://github.com/apohllo/srx-english)) are very accurate but suffer from poor performance.

Scalpel is based on a very simple principle that reduces the complexity of performing sentence segmentation. The idea is that it is simpler and more efficient to find occurrences of periods that do __not__ indicate the end of a sentence, rather than those who do. These occurrences are temporarily replaced by "placeholder" characters, and sentence splitting is subsequently performed. The placeholder characters are then replaced by the original characters.

**Usage**

gem install scalpel

```ruby
require 'scalpel'
Scalpel.cut("some text")
```

**Contributing**

Feel free to fork the project and send me a pull request!