https://github.com/takp/naive-bayes-sample
Infer the category of the document written in Japanese
https://github.com/takp/naive-bayes-sample
naive-bayes naive-bayes-algorithm
Last synced: about 1 month ago
JSON representation
Infer the category of the document written in Japanese
- Host: GitHub
- URL: https://github.com/takp/naive-bayes-sample
- Owner: takp
- Created: 2014-11-21T10:19:17.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2014-11-21T14:43:19.000Z (about 11 years ago)
- Last Synced: 2024-12-27T09:27:57.415Z (12 months ago)
- Topics: naive-bayes, naive-bayes-algorithm
- Language: Python
- Size: 117 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# naive-bayes-sample
Naive Bayes sample program.
infer the category of the document.
### Before Run
It's needed to install the "BeautifuSoup" : http://www.crummy.com/software/BeautifulSoup/
This App using "Yahoo morphological analysis API".
http://developer.yahoo.co.jp/webapi/jlp/ma/v1/parse.html
If you want to use in English, please modify the 'morphological.py'.
### Run
$ python naivebayes.py
### Reference
http://gihyo.jp/dev/serial/01/machine-learning/0003?page=1
### Naive Bayes
P(cat, doc) = P(cat|doc)P(doc) = P(doc|cat)P(cat)
=> P(cat|doc) = P(doc|cat)P(cat) / P(doc)
now, we can get the P(cat|doc) by calculationg (1) P(doc|cat) and (2) P(cat).
(1) P(doc|cat)
p(doc|cat) = P(word1|cat)P(word2|cat)...P(wordn|cat)
By the assumption of independence, it is possible to approximate to this function, as documents is the aggregation of the words.
P(word|cat) = ( number of word shows up in the category ) / ( number of all words )
(2) P(cat)
P(cat) = ( number of (this) category shows up (in train data) ) / ( number of all train data )
* Using Logarithm
It is needed to use logarithm because the value of every P() is so small that it may cause Underflow.
So we convert the multiply to sum of logs.
P(doc|cat)P(cat)
= P(word1|cat)P(word2|cat)...P(wordn|cat)P(cat)
= log P(word1|cat) + log P(word2|cat) + ... + log P(wordn|cat) + log P(cat)