https://github.com/collab-uniba/emotion_and_polarity_so
An emotion classifier of text containing technical content from the SE domain
https://github.com/collab-uniba/emotion_and_polarity_so
classifier emotion emotion-classification emotion-detection emotion-recognition polarity sentiment sentiment-analysis
Last synced: 3 months ago
JSON representation
An emotion classifier of text containing technical content from the SE domain
- Host: GitHub
- URL: https://github.com/collab-uniba/emotion_and_polarity_so
- Owner: collab-uniba
- Created: 2017-02-15T14:04:51.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2024-12-14T03:13:47.000Z (5 months ago)
- Last Synced: 2025-01-10T05:40:52.968Z (4 months ago)
- Topics: classifier, emotion, emotion-classification, emotion-detection, emotion-recognition, polarity, sentiment, sentiment-analysis
- Language: OpenEdge ABL
- Homepage: http://collab.di.uniba.it/research
- Size: 59.2 MB
- Stars: 76
- Watchers: 9
- Forks: 19
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# EmoTxt
A toolkit for emotion detection from technical text. It is part of the Collab Emotion Mining Toolkit ([EMTk](https://github.com/collab-uniba/EMTk)).## Fair Use Policy
Please, cite the following paper if you intend to use our tool for your own research:
> F. Calefato, F. Lanubile, N. Novielli. “[EmoTxt: A Toolkit for Emotion Recognition from Text](https://arxiv.org/abs/1708.03892)” In *Proceedings of the Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos, {ACII} Workshops 2017*, San Antonio, USA, Oct. 23-26, 2017, pp. 79-80, ISBN: 978-1-5386-0563-9.## Installation
You will need to install [Git LFS](https://git-lfs.github.com) extension to check out this project. Once installed and initialized, simply run:```bash
$ git clone https://github.com/collab-uniba/Emotion_and_Polarity_SO.git
```### Requirements
* Ram: 8GB
* Python 2.7.x
* Libraries
* `nltk`, `numpy`, `scikit_learn`, `scipy`, `pattern`
* Installation: open the command line and run
`$ pip install -r requirements.txt`
* Complete the nltk installation: Run the Python interpreter and type the commands
```
>>> import nltk
>>> nltk.download()
```* Java 8+
* Maven 3.x
* if you want to build the jar yourself type the following commands
```bash
cd java
mvn clean install
```
* The fat jar will be generated in the `java/target`folder with the name `EmotionAndPolarity-0.0.1-SNAPSHOT-jar-with-dependencies.jar`. Rename it as `Emotion_and_Polarity_SO.jar` and move it directly under the `java` folder.* R
* Libraries:
* `caret`, `LiblinearR` , `e1071`
* Installation: open the command line and run
`$ Rscript requirements.R`## Usage and dataset
In the following, we show first how to train a new model for emotion classification and, then, how to test the model on unseen data.For testing purposes, you can use the `sample.csv` input file available in the root of the repo. Other, more complex examples, look at the dataset files available under the subfolder [./java/DatasetSO/StackOverflowCSV](https://github.com/collab-uniba/Emotion_and_Polarity_SO/tree/master/java/DatasetSO/StackOverflowCSV).
If you are looking for the entire experimental dataset of ~5K Stack Overflow posts annotated with emotion, it is available from [this repository](https://github.com/collab-uniba/EmotionDatasetMSR18).
### Training a new model for emotion classification (70/30% split for train and test)
```bash
$ sh train.sh -i file.csv -d delimiter [-g] [-p] -e emotion
```where:
* `-i file.csv`: the input file coded in **UTF-8 without BOM**, containing the input corpus. Please, note that gold label are required for each item in the dataset. The format of the input file is the following:
```
id;label;text
...
22;NO;"""Excellent! This is exactly what I needed. Thanks!"""
23;YES;"""FEAR!!!!!!!!!!!"""
...
```
* `-d delimiter`: the delimiter used in the csv file (values in {`c`, `sc`}, where stands for comma and sc for semicolon). Please, note that all the example files provided here use semicolon as delimiter, so `-d sc` is a mandatory option during tests.
* `-g`: enables the extraction of n-grams (i.e,. bigrams and unigrams). N-grams extraction is mandatory for the first run when you want to train a new classification model for a given emotion, using your own dataset for the first time. Because n-gram extraction is computationally expensive, it should be skipped if you retrain the model for the same emotion using the same input file.
* `-p`: enables the extraction of features regarding politeness, mood and modality. Because this is computationally expensive, the switch is off by default.
* `-e emotion`: the specific emotion for which you want to train a classification model, with values in {`joy`, `anger`, `sadness`, `love`, `surprise`, `fear`}.As a result, the script will generate the following output files:
* An output folder named `training__/`, containing:
* `n-grams/`: a subfolder containing the extracted n-grams
* `idfs/`: a subfolder containing the IDFs computed for n-grams and WordNet Affect emotion words
* `feature-.csv`: a .csv file with the features extracted from the input corpus and used for training the model
* `liblinear/`:
* there are two subfolders: `DownSampling/` and `NoDownSampling/`. Each one contains:
* `trainingSet.csv`
* `testingSet.csv`
* eight models trained with liblinear `model__.Rda`, where `IDMODEL` is the ID of the liblinear model, with values in `{0,...,7}`):
* `performance__.txt`, containing the results of the parameter tuning for the model (best C) as performed by caret, the confusion matrix and the precision, recall and f-measure for the best cost for the specific emotion
* `predictions__.csv`, containing the test instances with the predicted labels for the specific emotion### Emotion detection
```bash
$ sh classify.sh -i file.csv -d delimiter -e emotion [-m model] [-f idf] [-o n-grams] [-l] [-p]
```where:
* `-i file.csv`: the input csv file with header and coded in **UTF-8 without BOM**, containing the corpus to be classified; the format of the input file is the following:
```
id;label;text
...
22;NO;"""Excellent! This is exactly what I needed. Thanks!"""
23;YES;"""FEAR!!!!!!!!!!!"""
...
```
* `-d delimiter`: the delimiter used in the csv file (values in {`c`, `sc`}, where stands for comma and sc for semicolon). Please, note that all the example files provided here use semicolon as delimiter, so `-d sc` is a mandatory option during tests.
* `-e emotion`: the specific emotion to be detected in the input file or text, defined in {`joy`, `anger`, `sadness`, `love`, `surprise`, `fear`}.
* `-m model`: the model file learnt during the training step (e.g., `model-anger.rda`). If you don't specify the model name, the default model will be used, that is the one learnt on our Stack Overflow gold standard.
* `-o n-grams`: if you specify a model name using `-m` (i.e., you don't want to use the default model for a given emotion) you are required to provide also the path to the folder containing the dictionaries extracted during the training step. This folder includes n-grams, i.e., `UnigramsList.txt` and `BigramsList.txt`.
* `-f idf`: if you specify a model name using `-m` (i.e., you don't want to use the default model for a given emotion) you are required to specify also the path to the folder containing the dictionaries with IDFs computed during the training step. The folder includes IDFs for n-grams (uni- and bi-grams) and for WordNet Affect lists of emotion words.
* `-l`: if presents , indicates `` contains a gold label in the column `label`.
* `-p`: enables the extraction of features regarding politeness, mood and modality. Because this is computationally expensive, the switch is off by default.As a result, the script will create an output folder named `classification__` containing:
* `predictions_.csv`: a csv file with header, containing a binary prediction (yes/no) for each line of the input corpus:
```
id;predicted
...
22;NO
23;YES
...
```
* `performance_.txt`: a file created only if the input corpus `` contains the column `label`; the file contains several performance metrics (Precision, Recall, F1, confusion matrix).For example, if you wanted to detect *anger* in the input file `sample.csv`, you would have to run:
```bash
$ sh classify.sh -i sample.csv -d sc -p -e anger
```