https://github.com/datahappy1/czech_language_sentiment_analyzer

Czech sentiment analyzer
https://github.com/datahappy1/czech_language_sentiment_analyzer

bootstrap chartsjs czech czech-language czech-sentiment-analyzer flask heroku heroku-app logistic-regression movie-ratings movie-reviews naive-bayes postgres python python-3 scraper sentiment sentiment-analysis sqlite3 support-vector-machine

Last synced: 6 months ago
JSON representation

Czech sentiment analyzer

Host: GitHub
URL: https://github.com/datahappy1/czech_language_sentiment_analyzer
Owner: datahappy1
License: mit
Created: 2019-07-20T20:02:24.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2023-05-22T22:36:57.000Z (over 2 years ago)
Last Synced: 2025-03-29T17:41:26.244Z (6 months ago)
Topics: bootstrap, chartsjs, czech, czech-language, czech-sentiment-analyzer, flask, heroku, heroku-app, logistic-regression, movie-ratings, movie-reviews, naive-bayes, postgres, python, python-3, scraper, sentiment, sentiment-analysis, sqlite3, support-vector-machine
Language: Python
Homepage: http://czester.herokuapp.com
Size: 204 MB
Stars: 3
Watchers: 0
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

##### 10000 ft. Overview
![10000 ft overview][10000ft_overview]

[10000ft_overview]: https://github.com/datahappy1/czech_language_sentiment_analyzer/blob/master/docs/img/10000ft_project_overview.png?raw=true "10000 ft. overview"

##### Data Collection
56k Czech movie reviews were collected using the /data_preparation/data_collector_movie_review_scraper.py
multithreaded HTML scraping module. These reviews were scrubbed using `langdetect` module to remove reviews written in Slovak language. This dataset was also scrubbed against a collection of Czech stopwords. To have the data balanced with the same amount of negative and positive reviews, the
final dataset had to be reduced to 11.5k positive and 11.5k negative reviews. Collected data was also stemmed before training the models.

##### ML Models
From `Scikit-Learn` Python library, `Naive Bayes`, `Logistic regression` and `Support Vector Machine` ML models were used
for training and testing data for text sentiment analysis.
The scripts for training and testing are located here:

/ml_models/logistic_regression

/ml_models/naive_bayes

/ml_models/support_vector_machine

The overall sentiment score for the specified text input is calculated as a weighted average based on the precision score accuracy of these 3 model predictions.

##### Flask web application
The Flask web application is currently hosted at https://czester.herokuapp.com, source code can be found in this location /flask_webapp/.
This application backend is written in Python using the `Flask` framework and `Bootstrap` for the templates styling. This app also provides the users with a simple API. The stats module is a result of an integration between `Chart.js` and `Flask` where the statistics data persistence layer can be either `Sqlite3` or `Heroku Postgres`.
If you provide this app with a environment variable named `DATABASE_URL` containing the Heroku Postgres DB URL like `postgres://YourPostgresUrl`, then remote `Heroku Postgres` will be used, otherwise local `Sqlite3` db instance will be used.

##### Input text dataflow diagram:
![Input text dataflow diagram][input_text_dataflow]

[input_text_dataflow]: https://github.com/datahappy1/czech_language_sentiment_analyzer/blob/master/docs/img/input_text_flow_diagram.png?raw=true "input text dataflow"

##### How to run this Flask App from local environment
1) create and activate a standard Python virtual or pipenv environment

2) `pip3` install the requirements from `requirements.txt`

3) set the working directory for instance to the path where you cloned this repo (Make sure it's the path where the Heroku `Procfile` file is located)

##### TODOs / Future ideas