https://github.com/OCannings/tf-idf

tf-idf elixir
https://github.com/OCannings/tf-idf

Last synced: 7 months ago
JSON representation

tf-idf elixir

Host: GitHub
URL: https://github.com/OCannings/tf-idf
Owner: OCannings
Created: 2015-08-28T07:47:59.000Z (about 10 years ago)
Default Branch: master
Last Pushed: 2020-03-03T16:48:13.000Z (over 5 years ago)
Last Synced: 2024-10-06T08:07:39.918Z (about 1 year ago)
Language: Elixir
Size: 155 KB
Stars: 17
Watchers: 1
Forks: 5
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

fucking-awesome-elixir - tfidf - An Elixir implementation of term frequency–inverse document frequency. (Algorithms and Data structures)
freaking_awesome_elixir - Elixir - An Elixir implementation of term frequencyâinverse document frequency. (Algorithms and Data structures)
fucking-awesome-elixir - tfidf - An Elixir implementation of term frequency–inverse document frequency. (Algorithms and Data structures)
fucking-awesome-elixir - tfidf - An Elixir implementation of term frequency–inverse document frequency. (Algorithms and Data structures)
awesome-elixir - tfidf - An Elixir implementation of term frequency–inverse document frequency. (Algorithms and Data structures)

README

          ![Travis CI Build Status](https://travis-ci.org/OCannings/tf-idf.svg?branch=master)

#Tfidf

An Elixir implementation of tf-idf

[Based on the blog post by Steven Loria](http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/)

##What is tf-idf?

> tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.

[tf-idf on Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

## Installation

```elixir

defp deps do

  [{:tfidf, "~> 0.1.0"}]

end

```

## Usage

### Tfidf.calculate(word, text, corpus, tokenize_fn \\\ &tokenize(&1))

 Calculates the tf-idf for a given word within a text and a corpus (List) of

  texts.

```elixir

iex> Tfidf.calculate("dog", "nice dog dog", ["dog hat", "dog", "cat mat", "duck"])

0.19178804830118723

```

  An optional tokenizer function can be passed as the last argument to replace the default tokenizer:

```elixir

iex> Tfidf.calculate("dog", "nice,dog,dog", ["dog,hat", "dog", "cat,mat", "duck"], &String.split(&1, ","))

0.19178804830118723

```

=====

### Tfidf.calculate(word, tokenized_text, corpus)

  Calculates the tf-idf for a given word within a pre-tokenized list and a corpus

  comprised of pre-tokenized lists.

  

```elixir

iex> Tfidf.calculate("dog", ["nice", "dog", "dog"], [["dog", "hat"], ["dog"], ["cat", "mat"], ["duck"]])

0.19178804830118723

```

=====

### Tfidf.calculate_all(text, corpus, tokenize_fn \\\ &tokenize(&1)) 

 Calculates the tf-idf for all words in a given text, returns a list

  of {word, score} tuples.

```elixir

iex> Tfidf.calculate_all("nice dog", ["dog hat", "dog", "cat mat", "duck"])

[{"nice", 0.6931471805599453}, {"dog", 0.14384103622589042}]

```

  As with `Tfidf.calculate/4` an optional tokenizer function can be passed

  as the last argument. This will be used in place of the default tokenizer.

  

```elixir

iex> Tfidf.calculate_all("nice,dog", ["dog,hat", "dog", "cat,mat", "duck"], &String.split(&1, ","))

[{"nice", 0.6931471805599453}, {"dog", 0.14384103622589042}]

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/OCannings/tf-idf

Awesome Lists containing this project

README