https://github.com/matiasinsaurralde/language
Experiments with letter frequencies, n-grams and language detection.
https://github.com/matiasinsaurralde/language
Last synced: 3 months ago
JSON representation
Experiments with letter frequencies, n-grams and language detection.
- Host: GitHub
- URL: https://github.com/matiasinsaurralde/language
- Owner: matiasinsaurralde
- License: mit
- Created: 2013-03-30T23:50:30.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2016-03-13T10:00:55.000Z (over 9 years ago)
- Last Synced: 2025-04-10T19:08:31.984Z (6 months ago)
- Language: Ruby
- Homepage: http://rlanguages.herokuapp.com/
- Size: 25.4 KB
- Stars: 23
- Watchers: 2
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# LanguageThe basic idea of this library is to detect languages by computing cosine similarity (unigrams, bigrams, etc.) between models and given texts.
This is a very popular approach based on Salton & McGill model. Also you may take a look at "Foundations of statistical natural language processing" by Schutze.## Models
This library ships with models for some common languages (currently english, spanish, italian, french and guarani). These models were generated from 200 books and 2000 Wikipedia articles for each language. You may generate your own models with the scripts (...look at the 'scripts' folder).
## IRB example
```ruby
irb(main):001:0> require './language'
irb(main):002:0> include Language
irb(main):003:0> example = Text.new( 'this is a sample sentence' )
irb(main):004:0> example.language_detection().first
=> [:english, 54.26]irb(main):005:0> another_example = Text.new( 'peteĩ tapiti opopo tapepe' )
irb(main):006:0> another_example.language_detection().first
=> [:guarani, 59.22]
```## Demo
http://rlanguages.herokuapp.com/
(built with [sinatra] (http://www.sinatrarb.com/), [jquery] (http://jquery.com/) and [text-effects] (http://www.jsplugins.com/Scripts/Plugins/View/Jquery-Text-Effects/))
## TODO
* Benchmarks (with different n-gram depths).
* Support for more languages.
* Multilingual processing (for spanglish, portuñol and jopará texts).## License
[MIT](https://github.com/matiasinsaurralde/language/blob/master/LICENSE)