Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/alxrm/scent-of-literature

Russian literature sentiment analysis in terms of very small dataset
https://github.com/alxrm/scent-of-literature

classification data-analysis sentiment-analysis sklearn tf-idf

Last synced: 9 days ago
JSON representation

Russian literature sentiment analysis in terms of very small dataset

Host: GitHub
URL: https://github.com/alxrm/scent-of-literature
Owner: alxrm
License: mit
Created: 2017-02-04T20:06:15.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2017-02-05T11:06:35.000Z (about 8 years ago)
Last Synced: 2024-12-06T14:46:47.598Z (2 months ago)
Topics: classification, data-analysis, sentiment-analysis, sklearn, tf-idf
Language: Python
Size: 68.4 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Scent of literature
Russian literature sentiment analysis in terms of very small dataset

## It uses
* _Pandas_ to read the input data
* _Sklearn_ for the classification work

## Usage

Just run this in terminal:

```
./eval.py
```

### Variables

* `is_test_run` is the boolean, which defines whether it should just show report about the performance by testing itself on a training dataset or perform a real prediction on a `test_file`
* `train_file` is the path to the training dataset, which should contain text and labels (right now columns are `1` and `2`, because of the structure of the default `train.tsv` file)
* `test_file` is the path to the file you want to perform prediction on, it should contain only a single column with text you want to analyze(by default it searches for 0-th column because of the `data.txt` structure)

## Under the hood

To create the vectors dictionary it uses [TfIdfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) which uses inverted frequencies table method to get the weights from the words and bigrams we give, more on tf-idf [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

To perform the classification it uses [SGD](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) classifier(also here is [wiki](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)) with [hinge](https://en.wikipedia.org/wiki/Hinge_loss) as a loss function, aka [SVM](https://en.wikipedia.org/wiki/Support_vector_machine), which shows the best results in sentiment analysis afaik, but has more tuning options than [LinearSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)

_Note_: the model's hyperparameters are chosen by sklearn's [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) (more on this [here](https://en.wikipedia.org/wiki/Hyperparameter_optimization)) and those are tuned to match the best [F1 score](https://en.wikipedia.org/wiki/F1_score)

## Contribution

If you want to improve the prediction performance somehow and you can prove it with the better F1 score, you are always welcome to send me some PRs