https://github.com/molybdenum-99/mormor
Morfologik dictionaries client in pure Ruby: POS tagging & spellcheck
https://github.com/molybdenum-99/mormor
morphology part-of-speech-tagger ruby
Last synced: 11 months ago
JSON representation
Morfologik dictionaries client in pure Ruby: POS tagging & spellcheck
- Host: GitHub
- URL: https://github.com/molybdenum-99/mormor
- Owner: molybdenum-99
- License: bsd-3-clause
- Created: 2019-06-21T16:13:41.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2023-01-21T13:22:34.000Z (over 3 years ago)
- Last Synced: 2024-04-25T04:02:44.500Z (about 2 years ago)
- Topics: morphology, part-of-speech-tagger, ruby
- Language: Ruby
- Size: 18.5 MB
- Stars: 6
- Watchers: 5
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: Changelog.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# MorMor
[](http://badge.fury.io/rb/mormor)
**MorMor** is pure Ruby [morfologik](https://github.com/morfologik/morfologik-stemming) dictionary client that could be used for POS (part of speech) tagging and simplistic spellchecking. _Morfologik_ format's distinguishing feature is it is primary dictionary format for [LanguageTool](https://github.com/languagetool-org/languagetool), therefore a lot of ready high-quality dictionaries exist.
## Features/Problems
* **No dependencies¹, pure Ruby**
* **Fast**: I don't have any detailed numbers, but naive test on my laptop shows 3 mln lookups/second on a very large dictionary (Polish, several million word forms).
* Relatively **memory-efficient**: Typical dictionary file size is 1-3 Mb, mormor just loads it into memory as bytes (e.g. each byte => Ruby Integer) and that's all memory it needs.
* **Dictionaries** for a lot of languages already exist: unlike your typical POS tagger, usage instructions does not start with "First, take your corpora and train the tagger as you please" (see "Dictionaries" section).
* To the moment, it is just a **naive** port of original Morfologik Java code, but it works with all the dictionaries I could find:
* Of possible dictionary formats, only FSA5 and CFSA2 are implemented (not CFSA);
* Of possible dictionary "encoders", only "SUFFIX" and "PREFIX" are implemented;
* No tests/specs, but it works (and checked thoroughly with existing dictionaries); TBH, original Morfologik doesn't have much, either;
* Morfologik's spellchecker suggestions/candidates are **not** ported, so mormor can be used only for "sanity" spellchecking ("this word is/is not in the dictionary")
¹The only runtime dependency is [backports](https://github.com/marcandre/backports) and that's only because I am too fond of modern Ruby features to sacrifice them to "no-dependencies" god.
## Usage
0. Install `mormor` gem (via bundler or just `[sudo] gem install mormor`)
1. Take a dictionary for your language (see "Dictionaries" section below)
2. Now...
```ruby
require 'mormor'
dictionary = MorMor::Dictionary.new('path/to/english')
dictionary.lookup('meowing')
# => [#]
dictionary.lookup('barks')
# => [#,
# #]
dictionary.lookup('borogoves')
# = nil
dictionary = MorMor::Dictionary.new('path/to/ukrainian')
dictionary.lookup("солов'їна")
# => [#,
# #]
```
`Dictionary#lookup` returns an array of structs which describe all possible base forms + part of speech /word form tags. (For example, "barks" could be a third person form of the verb "to bark", or plural form of noun "bark".)
Tags are dependent on the particular dictionary used and typically documented in a free form alongside the dictionaries.
## Dictionaries
A lot of dictionaries in Morfologik format could be found at [LanguageTool's repo](https://github.com/languagetool-org/languagetool). For example, for Polish language, [dictionary is at](https://github.com/languagetool-org/languagetool/tree/master/languagetool-language-modules/pl/src/main/resources/org/languagetool/resource/pl) `languagetool-language-modules/pl/src/main/resources/org/languagetool/resource/pl/`.
What you need there, are:
* `polish.dict` is a dictionary (binary finite-state-automata) itself
* `polish.info` is dictionary metadata
In order to use Polish dictionary with mormor, you need to place both files at the same folder, and then
```ruby
pl = MorMor::Dictionary.new('path/to/that/folder/polish') # without extension
pl.lookup('świetnie')
```
You may also be interested in `tagset.txt` file of the same folder, which has an explanation for all POS/forms tags in natural language (Polish language, for that case).
Sometimes (for example, in case of German and Ukrainian), LanguageTool repo contains not the dictionary itself, but a link to other repo/site where it can be downloaded.
Please **carefully consider** dictionary licenses when using them!
> **Note:** mormor repo contains copies of dictionary files from LanguageTool and referred projects, but they are **not** a part of the gem distribution and only used for testing the parser/lookup correctness, and demonstration purposes.
## License and credits
Most of the credit for algorithms and original code belong to original [Morfologik's](https://github.com/morfologik/morfologik-stemming) authors, and author of paper's they based their work on.
Ruby version is done by [Victor Shepelev](https://zverok.github.io).
The license is BSD, the same as the original Morfologik.