https://github.com/naritapandhe/naive-bayes-spark
Naive Bayes on Spark
https://github.com/naritapandhe/naive-bayes-spark
Last synced: 4 months ago
JSON representation
Naive Bayes on Spark
- Host: GitHub
- URL: https://github.com/naritapandhe/naive-bayes-spark
- Owner: naritapandhe
- License: mit
- Created: 2016-08-18T19:55:48.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-09-23T01:54:26.000Z (over 9 years ago)
- Last Synced: 2025-02-25T15:52:27.100Z (11 months ago)
- Language: Python
- Size: 31.3 KB
- Stars: 1
- Watchers: 3
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
##########################################################
# Scalable Document Classification using Naive Bayes on Spark
#########################################################
This was one of the projects for CSCI 8360 Data Science practicum course. We are using the Reuters news articles corpus, which is a set of news articles split into several categories. There are multiple class labels per document, but we consider just the following class labels:
* CCAT: Corporate/Industrial
* ECAT: Economics
* GCAT: Government/Social
* MCAT: Markets
Goal was to build Naive Bayes classifier without using any existing libraries/packages like MLLib, ML or Scikit-Learn.
# Details
## Preprocessing
The raw input was processed in the following steps:
* Exclude all the special characters and numbers. Only alphabets were retained.
* Lowercase all the words
* Remove stopwords. Our initial list of stopwords was a classical list of stopwords. But, based on the corpus, we tried to enrich our list and build context specific list of stopwords by applying the Zipf's Law.
* We utilized unigrams and took their counts, conditioned on class, into consideration, which formed the word vectors.
## Model
We implemented the standard Multinomial Naive Bayes Classifier with Laplace(add 1) smooothing and log probabilities. This model achieved an accuracy of 94.88%
# How to run
To execute the program, run following command:
```
path_to_spark_bin/spark-submit --master distributedNB.py
```
The output of the program can be viewed at the same location where the program is executing. Name of the output file: naiveBayesOutput.txt
Currently, the URLs of training and testing data are hardcoded. To train and test the program on files of your choice, please change the values of: docData - Training Documents, labelData - Labels of the training documents, testDocData - Testing documents
# Team Members:
[Narita Pandhe](https://github.com/naritapandhe/)
Shubhi Jain
Priyanka Luthra