https://github.com/shunk031/gwork
Classify gunosy news articles by Naive Bayes classifier and predict article category at django server
https://github.com/shunk031/gwork
Last synced: 2 months ago
JSON representation
Classify gunosy news articles by Naive Bayes classifier and predict article category at django server
- Host: GitHub
- URL: https://github.com/shunk031/gwork
- Owner: shunk031
- Created: 2017-02-20T14:53:55.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-03-02T13:52:10.000Z (over 8 years ago)
- Last Synced: 2025-02-28T03:32:44.117Z (7 months ago)
- Language: Python
- Homepage:
- Size: 44.9 KB
- Stars: 3
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# G Work
## Install requirement modules
``` shell
$ pip install -r python_requirements.txt
```## Collect news article from gunosy
``` shell
$ cd scripts
$ python crawl_page.py CATEGORY # specify article category```
The collected articles are stored in `data` of the current directory.
## Preprocess
### Make single csv file
Make each article data to one CSV file for each category. The CSV file is stored in `GClassifier/dataset/row`.
``` shell
$ cd scripts
$ python make_single_file.py all # specify article category or all
```### Wakatigaking
Do wakatigaki data and format it. Output CSV file is stored in `GClassifier/dataset/preprocess`.
``` shell
$ cd GClassifier
$ python g_preprocess.py all --wakati_type mecab-noun# if you use word-level n-gram
$ python g_preprocess.py all --wakati_type word-ngram --ngram_n 2
```### Train Naive Bayes model and dump it.
Train Naive Bayes model using the wakatigaking data and dump it to `GClassifier/naive_bayes_model.pkl`.
``` shell
$ cd GClassifier
$ python dump_classifier.py mecab-noun_all # or n-gram_all
```## Run Predict news category server
Run the server and access [http://localhost:8000/predict_category/](http://localhost:8000/predict_category/) then enter gunosy article URL.
``` shell
$ python manage.py runserver
```## Model validation
We evaluated classifier using 5-fold cross validation. The result is [here](https://github.com/shunk031/GWork/blob/master/GClassifier/README.md)
``` shell
$ cd Gclassifier
$ python train_cross_validation.py mecab-noun_all --kfold 5
```