https://github.com/liufuyang/rust-nb
https://github.com/liufuyang/rust-nb
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/liufuyang/rust-nb
- Owner: liufuyang
- License: mit
- Created: 2018-11-03T10:03:19.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-03-25T13:24:04.000Z (about 5 years ago)
- Last Synced: 2024-12-25T06:54:07.120Z (4 months ago)
- Language: Rust
- Size: 9.47 MB
- Stars: 4
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
A simple Naive Bayes Model in Rust.
Basic idea is from [blayze](https://github.com/Tradeshift/blayze) but
`rust-bn` is a Rust implementation and use a very simple
Rust `HashMap` to save model in memory.Later it should be very simple to enhance it with a hard drive
key-value store to make model persist.## How To Use
Simply checkout this repo and run some examples locally:
```
git clone [email protected]:liufuyang/rust-nb.gitcargo run --example spam
# or run a more complex example, use --release to speed up train/test process
cargo run --example 20newsgroup_stopwords --release
```And then you can modify those examples in the `examples` folder
and perhaps from there build your own models.Or you can use this package in your application by setting in Cargo.toml:
```
[dependencies]
...
rust_nb = "0.1.0"
```Just take make a main function looks like below. See how a simple email spam model might look like when you train and predict on it.
```rust
extern crate rust_nb;use rust_nb::{Feature, FeatureType, Model};
fn main() {
let mut model = Model::new();let input_train = vec![
(
"spam".to_owned(),
vec![
Feature {
feature_type: FeatureType::Text,
name: "email.body".to_owned(),
value: "Good day dear beneficiary. This is Secretary to president of Benin republic is writing this email ... heritage, tax, dollars, money, credit card...".to_owned(),
},
Feature {
feature_type: FeatureType::Category,
name: "email.domain".to_owned(),
value: "evil.com".to_owned(),
},
Feature {
feature_type: FeatureType::Gaussian,
name: "email.n_words".to_owned(),
value: "482".to_owned(),
},
],
),
(
"not spam".to_owned(),
vec![
Feature {
feature_type: FeatureType::Text,
name: "email.body".to_owned(),
value: "Hey bro, how's work these days, wanna join me for hotpot next week?".to_owned(),
},
Feature {
feature_type: FeatureType::Category,
name: "email.domain".to_owned(),
value: "gmail.com".to_owned(),
},
Feature {
feature_type: FeatureType::Gaussian,
name: "email.n_words".to_owned(),
value: "42".to_owned(),
},
],
),
];model.train("Spam checker", &input_train);
// test example 1
let result = model.predict(
"Spam checker",
&vec![
Feature {
feature_type: FeatureType::Text,
name: "email.body".to_owned(),
value: "Hey bro, This is Secretary to president want to give you some money. Please give me your credit card number ..."
.to_owned(),
},
Feature {
feature_type: FeatureType::Category,
name: "email.domain".to_owned(),
value: "example.com".to_owned(),
},
Feature {
feature_type: FeatureType::Gaussian,
name: "email.n_words".to_owned(),
value: "288".to_owned(),
},
],
);println!("{:?}\n", result);
assert!(result.get("spam").unwrap().abs() > 0.9);
// result will be:
// {"not spam": 0.04228956359881729, "spam": 0.9577104364011828}// test example 2
let result = model.predict(
"Spam checker",
&vec![
Feature {
feature_type: FeatureType::Text,
name: "email.body".to_owned(),
value: "Hey bro, hotpot again?".to_owned(),
},
Feature {
feature_type: FeatureType::Category,
name: "email.domain".to_owned(),
value: "gmail.com".to_owned(),
},
Feature {
feature_type: FeatureType::Gaussian,
name: "email.n_words".to_owned(),
value: "10".to_owned(),
},
],
);println!("{:?}\n", result);
assert!(result.get("not spam").unwrap().abs() > 0.9);
// result will be:
// {"spam": 0.03786816269284711, "not spam": 0.9621318373071529}
}
```## About Naive Bayes Model (and how to understand the code)
Firstly let's take a look at the Bayes equations for only 2 classes and a feature
As we can see the denominator is same for both probabilities of class 1 and 2 based on input x (and they sum equal to 1).
Thus we could simply only focusing calculating only the numerator part for each classes then normalized them all in the end
to get probabilities for each class prediction.This also generalize to the classes number greater than 2 as well.
Note here we use `<=` notation meaning we can infer `p(c_n | x)` by the value on the right later on, after we have all
calculations of classes with number index as `n.Now expand this to situations where we have multiple features, index of features noted with `i`, and let `X = x_1, x_2, ... x_i`,
we have:As "Naive" way of thinking, each feature appearance x_i is independent, so we could have:
## Feature type: Multinomial and Categorical
Currently we support two feature type:
* Categorical: each x_i in above equations are different value
* Multinomial (a.k.a Text feature): each x_i in above equations could be the same. For example, in the case of counting words to predict document class,
a word `apple` as `x_i` can appear multiple times, let's denote it as `t_i` (as in code it's called `inputFeatureCounts`.)One can also think of `Categorical` feature being as `Multinomial` feature but all `t_i` is 1.
So we for now just look at equations for `Multinomial`. Suppose now our `x_i` as unique words, the equations becomes as:
There are many multiplications with values less than one. To prevent the number gets to small to be presented as double
in computers, we can calculate the log value on each side instead:or
To calculate the priors `p(c_n)` and conditional probabilities `p(x_i| c_n)`:
So in the end we have those things need to calculate during training and predicting:
* During Training and Predicting, save or access these parameters:
* `N_cn` : prior count of class c_n. Calculated via function `logPrior` in code.
* `N`: sum of all prior count `N_cn` of all class c_n. Calculated via function `logPrior` in code.
* `count(x_i, c_n)`: count of word/feature i's appearance in class c_n `countFeatureAppearsInOutcome`
* `count(c_n)`: count of total numbers of word/feature appeared in class c_n `totalFeatureCountInOutcome`
* `|V|`: count of the unique words/feature/vocabulary among all classes. In code it is called `numOfUniqueFeaturesSeen`
* Only During Predicting, also calculate:
* `t_i`: number of time word/feature i appears in the incoming data for prediction. In code it is called `inputFeatureCounts`
* Constant:
* `epsilon`: pseudocount, no probability is ever set to be exactly zero. By default we set it as 1, This way of regularizing naive Bayes is called Laplace smoothing when the pseudocount is one