Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alxrm/scent-of-literature
Russian literature sentiment analysis in terms of very small dataset
https://github.com/alxrm/scent-of-literature
classification data-analysis sentiment-analysis sklearn tf-idf
Last synced: 9 days ago
JSON representation
Russian literature sentiment analysis in terms of very small dataset
- Host: GitHub
- URL: https://github.com/alxrm/scent-of-literature
- Owner: alxrm
- License: mit
- Created: 2017-02-04T20:06:15.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2017-02-05T11:06:35.000Z (about 8 years ago)
- Last Synced: 2024-12-06T14:46:47.598Z (2 months ago)
- Topics: classification, data-analysis, sentiment-analysis, sklearn, tf-idf
- Language: Python
- Size: 68.4 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Scent of literature
Russian literature sentiment analysis in terms of very small dataset## It uses
* _Pandas_ to read the input data
* _Sklearn_ for the classification work## Usage
Just run this in terminal:
```
./eval.py
```### Variables
* `is_test_run` is the boolean, which defines whether it should just show report about the performance by testing itself on a training dataset or perform a real prediction on a `test_file`
* `train_file` is the path to the training dataset, which should contain text and labels (right now columns are `1` and `2`, because of the structure of the default `train.tsv` file)
* `test_file` is the path to the file you want to perform prediction on, it should contain only a single column with text you want to analyze(by default it searches for 0-th column because of the `data.txt` structure)## Under the hood
To create the vectors dictionary it uses [TfIdfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) which uses inverted frequencies table method to get the weights from the words and bigrams we give, more on tf-idf [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
To perform the classification it uses [SGD](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) classifier(also here is [wiki](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)) with [hinge](https://en.wikipedia.org/wiki/Hinge_loss) as a loss function, aka [SVM](https://en.wikipedia.org/wiki/Support_vector_machine), which shows the best results in sentiment analysis afaik, but has more tuning options than [LinearSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
_Note_: the model's hyperparameters are chosen by sklearn's [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) (more on this [here](https://en.wikipedia.org/wiki/Hyperparameter_optimization)) and those are tuned to match the best [F1 score](https://en.wikipedia.org/wiki/F1_score)
## Contribution
If you want to improve the prediction performance somehow and you can prove it with the better F1 score, you are always welcome to send me some PRs