Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/abitdodgy/gibran
Gibran is an Elixir natural language processor, and a port of WordsCounted.
https://github.com/abitdodgy/gibran
elixir-lang natural-language-processing nlp
Last synced: about 1 month ago
JSON representation
Gibran is an Elixir natural language processor, and a port of WordsCounted.
- Host: GitHub
- URL: https://github.com/abitdodgy/gibran
- Owner: abitdodgy
- Created: 2015-09-30T00:22:48.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2017-04-23T19:38:21.000Z (over 7 years ago)
- Last Synced: 2024-08-09T13:43:30.290Z (4 months ago)
- Topics: elixir-lang, natural-language-processing, nlp
- Language: Elixir
- Homepage: http://hexdocs.pm/gibran
- Size: 47.9 KB
- Stars: 65
- Watchers: 5
- Forks: 3
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- freaking_awesome_elixir - Elixir - Gibran is an Elixir port of [WordsCounted](https://github.com/abitdodgy/words_counted), a natural language processor that extracts useful statistics from text. (Natural Language Processing (NLP))
- fucking-awesome-elixir - gibran - Gibran is an Elixir port of [WordsCounted](https://github.com/abitdodgy/words_counted), a natural language processor that extracts useful statistics from text. (Natural Language Processing (NLP))
- awesome-elixir - gibran - Gibran is an Elixir port of [WordsCounted](https://github.com/abitdodgy/words_counted), a natural language processor that extracts useful statistics from text. (Natural Language Processing (NLP))
README
Gibran
=========> Yesterday is but today's memory, and tomorrow is today's dream.
![Gibran](http://d.gr-assets.com/authors/1353732571p5/6466154.jpg)
[Gibran][2] is an Elixir natural language processor. Lofty goals for Gibran include:
- Metaphone phonetic coding system
- Soundex algorithm
- Porter Stemming algorithm
- String similarity as [described by Simon White](http://www.catalysoft.com/articles/StrikeAMatch.html)Currently, Gibran ships with the following features:
- Token count, unique token count, and character count
- Average characters per token
- `HashDict`s of tokens and their frequencies, lengths, and densities
- The longest token(s) and its length
- The most frequent token(s) and its frequency
- Unique tokens
- Levenshtein distance algorithm## Usage
Let's start with something simple.
```elixir
alias Gibran.Tokeniser
alias Gibran.Counterstr = "Yesterday is but today's memory, and tomorrow is today's dream."
Tokeniser.tokenise(str)
# => ["yesterday", "is", "but", "today's", "memory", "and", "tomorrow", "is", "today's", "dream"]Tokeniser.tokenise(str) |> Counter.uniq_token_count
# => 8
```By default Gibran uses the following regular expression to tokenise strings: `~r/[^\p{L}'-]/u`. You can provide your own regular expression through the `pattern` option. You can combine `pattern` with `exclude` to create sophisticated tokenisation strategies.
```
Tokeniser.tokenise(string, exclude: &String.length(&1) < 4) |> Counter.token_count
# => 6
```The `exclude` option accepts a string, a function, a regular expression, or a list combining any one or more of those types.
```elixir
# Using `exclude` with a function.
Tokeniser.tokenise("Kingdom of the Imagination", exclude: &(String.length(&1) < 10))
["imagination"]# Using `exclude` with a regular expression.
Tokeniser.tokenise("Sand and Foam", exclude: ~r/and/)
["foam"]# Using `exclude` with a string.
Tokeniser.tokenise("Eye of The Prophet", exclude: "eye of")
["the", "prophet"]# Using `exclude` with a list of a combination of types.
Tokeniser.tokenise("Eye of The Prophet", exclude: ["eye", &(String.ends_with?(&1, "he")), ~r/of/])
["prophet"]
```Gibran provides a shortcut for working with strings directly (instead of running them through the tokeniser first).
```elixir
Gibran.from_string(str, :token_count, opts: [exclude: &String.length(&1) < 4])
# => 6
```To avoid inconsistencies that arise from character-casing, Gibran normalises input before applying transformations.
### Levenshtein distance
Ordinary use:
```elixir
iex(1)> Gibran.Levenshtein.distance("kitten", "sitting")
3
```The Levenshtein distance for the same string is 0.
```elixir
iex(2)> Gibran.Levenshtein.distance("snail", "snail")
0
```The Levenshtein distance is case-sensitive.
```elixir
iex(3)> Gibran.Levenshtein.distance("HOUSEBOAT", "houseboat")
9
```The function can accept charlists as well as strings.
```elixir
iex(4)> Gibran.Levenshtein.distance('jogging', 'logger')
4
```The `doctests` contain extensive usage examples. Please take a look there for more details.
[1]: https://github.com/abitdodgy/words_counted
[2]: https://en.wikipedia.org/wiki/Kahlil_Gibran