Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/73rhodes/dclassify

Optimized Naive Bayesian classifier for NodeJS
https://github.com/73rhodes/dclassify

Last synced: 10 days ago
JSON representation

Optimized Naive Bayesian classifier for NodeJS

Host: GitHub
URL: https://github.com/73rhodes/dclassify
Owner: 73rhodes
License: mit
Created: 2012-07-27T14:49:38.000Z (over 12 years ago)
Default Branch: master
Last Pushed: 2023-07-17T02:55:34.000Z (over 1 year ago)
Last Synced: 2024-09-26T07:15:02.156Z (about 2 months ago)
Language: JavaScript
Homepage:
Size: 74.2 KB
Stars: 224
Watchers: 17
Forks: 56
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

        dclassify

=========

[![73rhodes](https://circleci.com/gh/73rhodes/dclassify.svg?style=shield)](https://app.circleci.com/pipelines/github/73rhodes/dclassify)

[![npm version](https://badge.fury.io/js/dclassify.svg)](http://badge.fury.io/js/dclassify)

[![DeepScan grade](https://deepscan.io/api/projects/2827/branches/20464/badge/grade.svg)](https://deepscan.io/dashboard#view=project&pid=2827&bid=20464)

`dclassify` is an optimized Naive Bayesian classifier for NodeJS that goes one step further

than your ordinary binary classifier by introducing a unique "probablility of absence" feature.

In some test cases this has led to a ~10% improvement over conventional binary classifiers.

It designed for classifying items based on a limited set of characteristics (aka "tokens")

rather than for general language processing.

Intro to Machine Learning with Node.JS

--------------------------------------

View [slides](http://73rhodes.github.io/talks/MachineLearning/) from a talk presented

at [OttawaJS](http://ottawajs.org).

Optimization

------------

The "probability of absence" optimization can be enabled with the `applyInverse` option. When

this option is set to `true`, dclassify will calculate probabilities based on the present

tokens as well as the inverse - a probability of absence for tokens that are not present. This is

unconventional but can produce better results when classifying items based on a limited set

of characteristics, especially when some of those characteristics are nearly always present

in one of the categories.

Most binary classifiers work by looking for specific tokens to be present. For example, an

email spam filter might categorize emails by looking for words that are considered spam-related.

It cares about the words that are present, but not about words that are absent, because there

are just too many of them.

In other cases we care if important key ingredients are missing. For example, an animal

without wings is most likely not a bird. And a mobile app without internet connectivity is

most likely not malware. Such "prevalent negatives" can be quite effective if the total set of

tokens is fairly small (say, a few hundred items) and includes such key ingredients. Using the

`applyInverse` option to look for prevalent negatives can significantly improve the results in

such cases.

Installation

------------

`npm install dclassify`

Usage

-----

1. Require the classifier and reference its utilities.

1. Create Document instances with names and an array of tokens representing the document's characteristics.

1. Add document instances to a DataSet using appropriate categories.

1. Create and train a classifier using the DataSet.

1. Test the classifier using a test Document.

``` javascript

    // module dependencies

    var dclassify = require('dclassify');

    // Utilities provided by dclassify

    var Classifier = dclassify.Classifier;

    var DataSet    = dclassify.DataSet;

    var Document   = dclassify.Document;

    

    // create some 'bad' test items (name, array of characteristics)

    var item1 = new Document('item1', ['a','b','c']);

    var item2 = new Document('item2', ['a','b','c']);

    var item3 = new Document('item3', ['a','d','e']);

    // create some 'good' items (name, characteristics)

    var itemA = new Document('itemA', ['c', 'd']);

    var itemB = new Document('itemB', ['e']);

    var itemC = new Document('itemC', ['b','d','e']);

    // create a DataSet and add test items to appropriate categories

    // this is 'curated' data for training

    var data = new DataSet();

    data.add('bad',  [item1, item2, item3]);    

    data.add('good', [itemA, itemB, itemC]);

    

    // an optimisation for working with small vocabularies

    var options = {

        applyInverse: true

    };

    

    // create a classifier

    var classifier = new Classifier(options);

    

    // train the classifier

    classifier.train(data);

    console.log('Classifier trained.');

    console.log(JSON.stringify(classifier.probabilities, null, 4));

    

    // test the classifier on a new test item

    var testDoc = new Document('testDoc', ['b','d', 'e']);    

    var result1 = classifier.classify(testDoc);

    console.log(result1);

```

The output of the above will be:

```javascript

{

    "category": "good",

    "probability": 1,

    "timesMoreLikely": "Infinity",

    "secondCategory": "bad",

    "probabilities": [

        { "category": "good", "probability": 0.09876543209876543 },

        { "category": "bad", "probability": 0 }

    ]

}

```

Probabilities

-------------

The probabilities get calculated like this.

``` json

    {

        "bad": {

            "a": 1,

            "b": 0.6666666666666666,

            "c": 0.6666666666666666,

            "d": 0.3333333333333333,

            "e": 0.3333333333333333

        },

        "good": {

            "a": 0,

            "b": 0.3333333333333333,

            "c": 0.3333333333333333,

            "d": 0.6666666666666666,

            "e": 0.6666666666666666

        }

    }

```

Output

------

Standard results (without the `applyInverse: true` option) would look like this:

``` json

    {

        "category": "good",

        "probability": 0.6666666666666666,

        "timesMoreLikely": 2,

        "secondCategory": "bad",

        "probabilities": [

            { "category": "good", "probability": 0.14814814814814814},

            { "category": "bad", "probability": 0.07407407407407407}

        ]

    }

```

If you use the `applyInverse: true` option, the results are much more emphatic, because training

indicates bad items never lack the "a" token.

``` json

    {

        "category": "good",

        "probability": 1,

        "timesMoreLikely": "Infinity",

        "secondCategory": "bad",

        "probabilities": [

            { "category": "good", "probability": 0.09876543209876543 },

            { "category": "bad", "probability": 0 }

        ]

    }

```