Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/OCannings/tf-idf
tf-idf elixir
https://github.com/OCannings/tf-idf
Last synced: 3 months ago
JSON representation
tf-idf elixir
- Host: GitHub
- URL: https://github.com/OCannings/tf-idf
- Owner: OCannings
- Created: 2015-08-28T07:47:59.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2020-03-03T16:48:13.000Z (almost 5 years ago)
- Last Synced: 2024-10-06T08:07:39.918Z (4 months ago)
- Language: Elixir
- Size: 155 KB
- Stars: 17
- Watchers: 1
- Forks: 5
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- fucking-awesome-elixir - tfidf - An Elixir implementation of term frequency–inverse document frequency. (Algorithms and Data structures)
- awesome-elixir - tfidf - An Elixir implementation of term frequency–inverse document frequency. (Algorithms and Data structures)
- freaking_awesome_elixir - Elixir - An Elixir implementation of term frequencyâinverse document frequency. (Algorithms and Data structures)
README
![Travis CI Build Status](https://travis-ci.org/OCannings/tf-idf.svg?branch=master)
#Tfidf
An Elixir implementation of tf-idf[Based on the blog post by Steven Loria](http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/)
##What is tf-idf?
> tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.[tf-idf on Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
## Installation
```elixir
defp deps do
[{:tfidf, "~> 0.1.0"}]
end
```## Usage
### Tfidf.calculate(word, text, corpus, tokenize_fn \\\ &tokenize(&1))
Calculates the tf-idf for a given word within a text and a corpus (List) of
texts.
```elixir
iex> Tfidf.calculate("dog", "nice dog dog", ["dog hat", "dog", "cat mat", "duck"])
0.19178804830118723
```
An optional tokenizer function can be passed as the last argument to replace the default tokenizer:
```elixir
iex> Tfidf.calculate("dog", "nice,dog,dog", ["dog,hat", "dog", "cat,mat", "duck"], &String.split(&1, ","))
0.19178804830118723
```=====
### Tfidf.calculate(word, tokenized_text, corpus)
Calculates the tf-idf for a given word within a pre-tokenized list and a corpus
comprised of pre-tokenized lists.
```elixir
iex> Tfidf.calculate("dog", ["nice", "dog", "dog"], [["dog", "hat"], ["dog"], ["cat", "mat"], ["duck"]])
0.19178804830118723
```=====
### Tfidf.calculate_all(text, corpus, tokenize_fn \\\ &tokenize(&1))
Calculates the tf-idf for all words in a given text, returns a list
of {word, score} tuples.```elixir
iex> Tfidf.calculate_all("nice dog", ["dog hat", "dog", "cat mat", "duck"])
[{"nice", 0.6931471805599453}, {"dog", 0.14384103622589042}]
```As with `Tfidf.calculate/4` an optional tokenizer function can be passed
as the last argument. This will be used in place of the default tokenizer.
```elixir
iex> Tfidf.calculate_all("nice,dog", ["dog,hat", "dog", "cat,mat", "duck"], &String.split(&1, ","))
[{"nice", 0.6931471805599453}, {"dog", 0.14384103622589042}]
```