{"id":15673163,"url":"https://github.com/kysely/sentiment-analysis-czech","last_synced_at":"2025-07-05T02:35:43.350Z","repository":{"id":95960968,"uuid":"110200705","full_name":"kysely/sentiment-analysis-czech","owner":"kysely","description":"Conducting and publishing sentiment analysis experiments in the Czech language","archived":false,"fork":false,"pushed_at":"2019-12-14T00:40:40.000Z","size":25664,"stargazers_count":16,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-05-06T22:31:14.434Z","etag":null,"topics":["czech","machine-learning","sentiment-analysis"],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kysely.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-11-10T04:00:15.000Z","updated_at":"2023-05-09T09:02:43.000Z","dependencies_parsed_at":null,"dependency_job_id":"e1dd136f-1739-4c77-87cf-dfd645b7d29e","html_url":"https://github.com/kysely/sentiment-analysis-czech","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kysely/sentiment-analysis-czech","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kysely%2Fsentiment-analysis-czech","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kysely%2Fsentiment-analysis-czech/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kysely%2Fsentiment-analysis-czech/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kysely%2Fsentiment-analysis-czech/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kysely","download_url":"https://codeload.github.com/kysely/sentiment-analysis-czech/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kysely%2Fsentiment-analysis-czech/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263671838,"owners_count":23494047,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["czech","machine-learning","sentiment-analysis"],"created_at":"2024-10-03T15:38:02.013Z","updated_at":"2025-07-05T02:35:43.334Z","avatar_url":"https://github.com/kysely.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Sentence-Level Analysis Using Word Embeddings And CNNs\n*This experiment was my capstone project for Machine Learning Nanodegree at Udacity.*\n\nThe overall goal of this project was to create a simple sentence-level **sentiment classifier for the Czech language** that will differentiate between neutral, negative and positive sentiments.\n\nIt leverages the power of **word embeddings** for representing text data in 100-dimensional vector space and two-layer **convolutional neural network** for extracting features from the encoded text data.\n\nMain evaluation metric for the classifier is *categorical accuracy* computed on the testing set and *weighted F\u003csub\u003e1\u003c/sub\u003e score*. Baseline is a *weighted random guess*.\n\n[**→ Open the project notebook**][project notebook]\u003cbr /\u003e\n[→ Download the full project report][fullreport]\n\n## Project Summary\nFindings and notes from the research by [Habernal et al. (2013)][habernal] which explored the sentiment recognition in Czech social media posts were used to guide the project. Datasets for the experiment also come from the same research team.\n\n### Datasets\nThere are three sets [provided by the team](http://liks.fav.zcu.cz/sentiment/):\n\n- **Facebook:** 10 000 samples\u2028\u003cbr /\u003e with 2 587 positive, 5 174 neutral, 1 991 negative and 248 bipolar posts\n- **ČSFD (Czech Film Database):** 91 381 samples\u2028\u003cbr /\u003ewith 30 897 positive, 30 768 neutral, and 29 716 negative movie reviews\n- **MALL.CZ (largest all-round e-commerce in Czechia):** 145 307 samples\u2028\u003cbr /\u003ewith 102 977 positive, 31 943 neutral, and 10 387 negative product reviews\n\nThe corpora however contain entries of arbitrary lengths. Since the goal was *sentence-level* classifier, it was necessary to filter out any samples that were longer than one sentence. \n\nYou can read more on extensive data pre-processing in [the full report][fullreport].\n\n\n### Models\nSince there were multiple datasets available, **4 individual models were trained on different subsets and joints** (three original corpora and one combined). All models share the same convolutional neural network architecture (displayed below), but were trained on different batch sizes and number of epochs.\n\nDuring its training, network tries to minimalize the cross entropy loss evaluated on the cross validation set which is obtained by withholding 15% of the training data.\n\n\u0026nbsp;\n\n![ConvNet architecture used in this project](./images/cnn.png)\n\n\u0026nbsp;\n\n## Results\n\n| *model*    | Accuracy Score | F\u003csub\u003e1\u003c/sub\u003e Score | Naïve Accuracy | Naïve F\u003csub\u003e1\u003c/sub\u003e | Cross Entropy | # of Epochs | Batch Size |\n| :------- | -----: | ---: | -----: | ---: | -----: | ---: | ---: |\n| Facebook | 71.62% | 0.71 | 38.43% | 0.39 | 0.6102 | 5    | 20   |\n| ČSFD     | 71.34% | 0.71 | 33.88% | 0.33 | 0.6283 | 5    | 20   |\n| MALL.CZ  | 82.52% | 0.81 | 62.51% | 0.62 | 0.4495 | 20   | 1000 |\n| Combined | 67.82% | 0.67 | 32.23% | 0.67 | 0.7333 | 20   | 1000 |\n\nThe picture below displays four new fabricated questions and visualizes their predictions using the 4 trained models. Freely translated English versions are attached for reference.\n\nNote how the second sentence was chosen in contrast to the first one. It also illustrates the upper accuracy limit of sentiment analysis mentioned in the research by [Habernal et al. (2013)][habernal]. While some might see it as clearly neutral, it does have indications of negativity and might cause different labelling even among humans.\n\nYou can read more on the results justification in [the full report][fullreport].\n\n![Four random sentences and their classification using the four models](./images/tests.png)\n\n\u0026nbsp;\n\n---\n\n## Running the Project\n\nDevelopment code is written in **Python 3** and can be found in [the project Notebook][project notebook]. All important functions do have defined *docstring* with sufficient explanation of their inner workings. \n\n### Libraries used:\n- Pandas\n- NumPy\n- scikit-learn\n- NLTK\n- Seaborn\n- matplotlib\n- Keras (TensorFlow backend)\n- fastText\n- [`czech_stemmer.py`](http://research.variancia.com/czech_stemmer/)\n\n### Installing fastText\n\nSince fastText library doesn't have a stable Python wrapper yet, it is used in the project via `subprocess`.\n\nYou will however need a working `fasttext` binary. One for Mac is already provided in the `lib` directory. If that one doesn't work for you, you can build our own binary by [following this easy guide](https://fasttext.cc/docs/en/support.html).\n\nIn any case, please make sure there is a working **`fasttext`** binary in the **`lib`** directory as a script in the project will make a call to the file on that explicit path.\n\n### Provided data\n\nAll of the text datasets are already provided in the `data` directory. However, they can be downloaded from their [origin here](http://liks.fav.zcu.cz/sentiment/).\n\nThe weights for our trained models are also included in `models` directory. You can load them up by calling a `.load()` method on a `CNN` model instance. However, if you try training them again in the Notebook, these weights will be overwritten.\n\n### Data to be generated\n\nWhen running the cells in provided Notebook, certain new data will be created as well. \n\nInside `data` directory, these files will be generated:\n- a `.txt` file for word vectors training\n- 8 `.npy` files holding final processed text data in numpy array format\n\nAfter training the words vectors using fastText, a `combined_corpora_processed.bin` word model will be saved in `word_models` directory.\n\n---\n\n## Licensing\n\n### Code and Implementation\nCopyright © 2017 Radek Kyselý.\n\nLicensed under [MIT](https://github.com/kysely/sentiment-analysis-czech/blob/sentence-level/LICENSE)\n\n### Datasets\nCopyright © 2013 Ivan Habernal, Tomáš Ptáček and Josef Steinberger.\n\nLicensed under [Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License](https://github.com/kysely/sentiment-analysis-czech/blob/sentence-level/licenses/DATASETS)\n\n### fastText Library\nCopyright © 2016-present, Facebook, Inc. All rights reserved.\n\nLicensed under [BSD](https://github.com/kysely/sentiment-analysis-czech/blob/sentence-level/licenses/FASTTEXT)\n\n### Czech Stemmer\nCopyright © 2010 Luís Gomes.\u003cbr /\u003e\nOriginal Java version by Ljiljana Dolamic, University of Neuchatel.\n\nLicensed under [Creative Commons Attribution 3.0 Unported License](https://github.com/kysely/sentiment-analysis-czech/blob/sentence-level/licenses/CZECH_STEMMER)\n\n---\n\n[← See more experiments of Czech sentiment analysis](https://github.com/kysely/sentiment-analysis-czech)\n\n[project notebook]: https://github.com/kysely/sentiment-analysis-czech/blob/sentence-level/Sentiment%20Analysis%20in%20Czech.ipynb\n[fullreport]: https://github.com/kysely/sentiment-analysis-czech/blob/sentence-level/capstone_report.pdf\n[habernal]: http://www.aclweb.org/anthology/W13-1609","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkysely%2Fsentiment-analysis-czech","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkysely%2Fsentiment-analysis-czech","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkysely%2Fsentiment-analysis-czech/lists"}