https://github.com/takp/naive-bayes-sample

Infer the category of the document written in Japanese
https://github.com/takp/naive-bayes-sample

naive-bayes naive-bayes-algorithm

Last synced: about 1 month ago
JSON representation

Infer the category of the document written in Japanese

Host: GitHub
URL: https://github.com/takp/naive-bayes-sample
Owner: takp
Created: 2014-11-21T10:19:17.000Z (about 11 years ago)
Default Branch: master
Last Pushed: 2014-11-21T14:43:19.000Z (about 11 years ago)
Last Synced: 2024-12-27T09:27:57.415Z (12 months ago)
Topics: naive-bayes, naive-bayes-algorithm
Language: Python
Size: 117 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # naive-bayes-sample

Naive Bayes sample program.

infer the category of the document. 

### Before Run

It's needed to install the "BeautifuSoup" : http://www.crummy.com/software/BeautifulSoup/

This App using "Yahoo morphological analysis API".

http://developer.yahoo.co.jp/webapi/jlp/ma/v1/parse.html

If you want to use in English, please modify the 'morphological.py'.

### Run

	$ python naivebayes.py

### Reference

http://gihyo.jp/dev/serial/01/machine-learning/0003?page=1

### Naive Bayes

	P(cat, doc) = P(cat|doc)P(doc) = P(doc|cat)P(cat)

	=> P(cat|doc) = P(doc|cat)P(cat) / P(doc)

now, we can get the P(cat|doc) by calculationg (1) P(doc|cat) and (2) P(cat).

(1) P(doc|cat)

	p(doc|cat) = P(word1|cat)P(word2|cat)...P(wordn|cat)

By the assumption of independence, it is possible to approximate to this function, as documents is the aggregation of the words.

	P(word|cat) = ( number of word shows up in the category ) / ( number of all words )

(2) P(cat)

	P(cat) = ( number of (this) category shows up (in train data) ) / ( number of all train data )

* Using Logarithm

It is needed to use logarithm because the value of every P() is so small that it may cause Underflow.

So we convert the multiply to sum of logs.

	P(doc|cat)P(cat)

	= P(word1|cat)P(word2|cat)...P(wordn|cat)P(cat)

	= log P(word1|cat) + log P(word2|cat) + ... + log P(wordn|cat) + log P(cat)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/takp/naive-bayes-sample

Awesome Lists containing this project

README