Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tomdionysus/stringtree-ruby
StringTree is a fast forward-only matching tokeniser for ruby
https://github.com/tomdionysus/stringtree-ruby
Last synced: about 2 months ago
JSON representation
StringTree is a fast forward-only matching tokeniser for ruby
- Host: GitHub
- URL: https://github.com/tomdionysus/stringtree-ruby
- Owner: tomdionysus
- License: mit
- Created: 2012-10-22T23:02:25.000Z (about 12 years ago)
- Default Branch: master
- Last Pushed: 2015-08-25T21:35:41.000Z (over 9 years ago)
- Last Synced: 2024-11-10T16:53:47.353Z (about 2 months ago)
- Language: Ruby
- Size: 5.27 MB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# stringtree-ruby
[![Build Status](https://travis-ci.org/tomdionysus/stringtree-ruby.svg?branch=master)](https://travis-ci.org/tomdionysus/stringtree-ruby)
[![Coverage Status](https://coveralls.io/repos/tomdionysus/stringtree-ruby/badge.svg?branch=master&service=github)](https://coveralls.io/github/tomdionysus/stringtree-ruby?branch=master)
[![Gem Version](https://badge.fury.io/rb/stringtree.svg)](http://badge.fury.io/rb/stringtree)
[![Dependency Status](https://gemnasium.com/tomdionysus/stringtree.svg)](https://gemnasium.com/tomdionysus/stringtree)
[![Gem Downloads](http://ruby-gem-downloads-badge.herokuapp.com/stringtree?color=brightgreen)](http://ruby-gem-downloads-badge.herokuapp.com/stringtree?color=brightgreen)StringTree is a fast forward-only tokeniser and partial string matcher, that is, it can:
* Load a dictionary of arbitarty size and record count - e.g. an actual dictionary, an english word list - where each record is associated with a key - e.g. a numeric identifier for the word.
* Parse an arbitary data string in a single pass, finding and storing instances of each item in the dictionary and storing their offsets and associated keys.
* Host a set of strings in such a way as to efficiently match partial input strings against the dictionaryThis has become my 'hello world' over the years with any new language. I use it to get to know a language, as implementing it correctly involves many of the usual concepts needed get started coding from the hip (syntax, grammar, classes, public/private instance vars, statics, pass-by value/pass-by-reference etc.) not to mention usual code support skills like how to set up unit tests for this language and environment, etc.
## Installation
```bash
gem install stringtree
```## Implementation
StringTree is based on multidimensional binary trees - Each node in the tree has its usual left/right references to its children, but also an optional 'down' reference, which would refer to the root of the binary tree representing the next character in the string from this point. This version also has an 'up' reference on the node which can be faster for iterating backward through a set of trees.
From a CS point of view, Tree is a specific implementation of an n-dimensional trie.
## Limitations
* Tree is not intended and should not be used as a key/value store. There are much faster algorithims for storing such data.
* Tree should not be used in place of regular expressions. Tree is good when the total number and/or size of the tokens to be found is either unknown or dynamic.
* Tree is essentially a byte tokeniser and is case-sensitive. It would be trivial to extend it to search in an case insensitive manner, however.## Applications
Tree has the following applications:
* Spellchecking
* Virus scanning
* Partial string matching, e.g. autocomplete
* Tokenizing, e.g. Matching Bank Transaction data to Businesses from reference/particular/code fields, etc.## Environment
The demo is tested and runs under Ruby 1.9.3, and will probably work under earlier and later versions but this is not guaranteed. The unit tests require rspec also:
```bash
bundle install
```## Specs
Specs are rspec, as follows:
```bash
rspec
```## Documentation
Regenerate the Documentation with `rake`:
```bash
rake rdoc
```## Demo
```bash
bundle exec ruby examples/demo.rb
```This will generate two files:
* hamlet.tokens.txt
* warandpeace.tokens.txtEach of which contain a set of lines as such:
28: platform 41003 (8)
Where 28 is the offset where the token was found, 'platform' is the token itself, 41003 is the id of the token (in this case the line number of the dictionary file), and (8) is the length of the token.
The demo code then goes into a console, which will do partial searches within the dictionary. Type a partial word and press enter, and the demo will show all words in the dictionary that start with the partial entered.
Type 'exit' to finish the demo.## Code of Conduct
The StringTree project is committed to the [Contributor Covenant](http://contributor-covenant.org). Please read [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) before making any contributions or comments.
## References
* http://www.ruby-doc.org
* http://en.wikipedia.org/wiki/Trie