https://github.com/shunk031/gwork

Classify gunosy news articles by Naive Bayes classifier and predict article category at django server
https://github.com/shunk031/gwork

Last synced: 2 months ago
JSON representation

Classify gunosy news articles by Naive Bayes classifier and predict article category at django server

README

# G Work

## Install requirement modules

``` shell
$ pip install -r python_requirements.txt
```

## Collect news article from gunosy

``` shell
$ cd scripts
$ python crawl_page.py CATEGORY # specify article category

```

The collected articles are stored in `data` of the current directory.

## Preprocess

### Make single csv file

Make each article data to one CSV file for each category. The CSV file is stored in `GClassifier/dataset/row`.

``` shell
$ cd scripts
$ python make_single_file.py all # specify article category or all
```

### Wakatigaking

Do wakatigaki data and format it. Output CSV file is stored in `GClassifier/dataset/preprocess`.

``` shell
$ cd GClassifier
$ python g_preprocess.py all --wakati_type mecab-noun

# if you use word-level n-gram
$ python g_preprocess.py all --wakati_type word-ngram --ngram_n 2
```

### Train Naive Bayes model and dump it.

Train Naive Bayes model using the wakatigaking data and dump it to `GClassifier/naive_bayes_model.pkl`.

``` shell
$ cd GClassifier
$ python dump_classifier.py mecab-noun_all # or n-gram_all
```

## Run Predict news category server

Run the server and access [http://localhost:8000/predict_category/](http://localhost:8000/predict_category/) then enter gunosy article URL.

``` shell
$ python manage.py runserver
```

## Model validation

We evaluated classifier using 5-fold cross validation. The result is [here](https://github.com/shunk031/GWork/blob/master/GClassifier/README.md)

``` shell
$ cd Gclassifier
$ python train_cross_validation.py mecab-noun_all --kfold 5
```