{"id":27818127,"url":"https://github.com/demfier/multimodal-speech-emotion-recognition","last_synced_at":"2025-05-01T15:41:06.772Z","repository":{"id":40960918,"uuid":"174432286","full_name":"Demfier/multimodal-speech-emotion-recognition","owner":"Demfier","description":"Lightweight and Interpretable ML Model for Speech Emotion Recognition and Ambiguity Resolution (trained on IEMOCAP dataset)","archived":false,"fork":false,"pushed_at":"2023-12-21T02:58:39.000Z","size":12522,"stargazers_count":337,"open_issues_count":8,"forks_count":84,"subscribers_count":12,"default_branch":"master","last_synced_at":"2023-12-21T05:30:35.593Z","etag":null,"topics":["iemocap","librosa","lstm","multimodal-emotion-recognition","pandas","python3","pytorch","scikit-learn","speech-emotion-recognition"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Demfier.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-03-07T22:50:22.000Z","updated_at":"2023-12-20T03:39:19.000Z","dependencies_parsed_at":"2022-08-27T02:49:37.842Z","dependency_job_id":"c3909534-53c6-4256-8f7b-3404f8b7a866","html_url":"https://github.com/Demfier/multimodal-speech-emotion-recognition","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Demfier%2Fmultimodal-speech-emotion-recognition","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Demfier%2Fmultimodal-speech-emotion-recognition/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Demfier%2Fmultimodal-speech-emotion-recognition/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Demfier%2Fmultimodal-speech-emotion-recognition/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Demfier","download_url":"https://codeload.github.com/Demfier/multimodal-speech-emotion-recognition/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251901738,"owners_count":21662434,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["iemocap","librosa","lstm","multimodal-emotion-recognition","pandas","python3","pytorch","scikit-learn","speech-emotion-recognition"],"created_at":"2025-05-01T15:41:05.994Z","updated_at":"2025-05-01T15:41:06.766Z","avatar_url":"https://github.com/Demfier.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multimodal Speech Emotion Recognition and Ambiguity Resolution\n\n## Overview\nIdentifying emotion from speech is a non-trivial task pertaining to the ambiguous definition of emotion itself. In this work, we build light-weight multimodal machine learning models and compare it against the heavier and less interpretable deep learning counterparts. For both types of models, we use hand-crafted features from a given audio signal. Our experiments show that the light-weight models are comparable to the deep learning baselines and even outperform them in some cases, achieving state-of-the-art performance on the IEMOCAP dataset.\n\nThe hand-crafted feature vectors obtained are used to train two types of models:\n\n1. ML-based: Logistic Regression, SVMs, Random Forest, eXtreme Gradient Boosting and Multinomial Naive-Bayes.\n2. DL-based: Multi-Layer Perceptron, LSTM Classifier\n\nThis project was carried as a course project for the course CS 698 - Computational Audio taught by [Prof. Richard Mann](https://cs.uwaterloo.ca/~mannr/) at the University of Waterloo. For a more detailed explanation, please check the [report](https://arxiv.org/abs/1904.06022).\n\n## Datasets\nThe [IEMOCAP](https://link.springer.com/content/pdf/10.1007%2Fs10579-008-9076-6.pdf) dataset was used for all the experiments in this work. Please refer to the [report](https://arxiv.org/abs/1904.06022) for a detailed explanation of pre-processing steps applied to the dataset.\n\n## Requirements\nAll the experiments have been tested using the following libraries:\n- xgboost==0.82\n- torch==1.0.1.post2\n- scikit-learn==0.20.3\n- numpy==1.16.2\n- jupyter==1.0.0\n- pandas==0.24.1\n- librosa==0.7.0\n\nTo avoid conflicts, it is recommended to setup a new python virtual environment to install these libraries. Once the env is setup, run `pip install -r requirements.txt` to install the dependencies.\n\n## Instructions to run the code\n1. Clone this repository by running `git clone git@github.com:Demfier/multimodal-speech-emotion-recognition`.\n2. Go to the root directory of this project by running `cd multimodal-speech-emotion-recognition/` in your terminal.\n3. Start a jupyter notebook by running `jupyter notebook` from the root of this project.\n4. Run `1_extract_emotion_labels.ipynb` to extract labels from transriptions and compile other required data into a csv.\n5. Run `2_build_audio_vectors.ipynb` to build vectors from the original wav files and save into a pickle file\n6. Run `3_extract_audio_features.ipynb` to extract 8-dimensional audio feature vectors for the audio vectors\n7. Run `4_prepare_data.ipynb` to preprocess and prepare audio + video data for experiments\n8. It is recommended to train `LSTMClassifier` before running any other experiments for easy comparsion with other models later on:\n  - Change `config.py` for any of the experiment settings. For instance, if you want to train a speech2emotion classifier, make necessary changes to `lstm_classifier/s2e/config.py`. Similar procedure follows for training text2emotion (`t2e`) and text+speech2emotion (`combined`) classifiers.\n  - Run `python lstm_classifier.py` from `lstm_classifier/{exp_mode}` to train an LSTM classifier for the respective experiment mode (possible values of `exp_mode: s2e/t2e/combined`)\n9. Run `5_audio_classification.ipynb` to train ML classifiers for audio\n10. Run `5.1_sentence_classification.ipynb` to train ML classifiers for text\n11. Run `5.2_combined_classification.ipynb` to train ML classifiers for audio+text\n\n**Note:** Make sure to include correct model paths in the notebooks as not everything is relative right now and it needs some refactoring\n\n**UPDATE**: You can access the preprocessed data files here to skip the steps 4-7: [https://www.dropbox.com/scl/fo/jdzz2y9nngw9rxsbz9vyj/h?rlkey=bji7zcqclusagzfwa7alm59hx\u0026dl=0](https://www.dropbox.com/scl/fo/jdzz2y9nngw9rxsbz9vyj/h?rlkey=bji7zcqclusagzfwa7alm59hx\u0026dl=0)\n\n## Results\nAccuracy, F-score, Precision and Recall has been reported for the different experiments.\n\n**Audio**\n\nModels | Accuracy | F1 | Precision | Recall\n---|---|---|---|---\nRF | 56.0 | **56.0** | 57.2 | **57.3**\nXGB | 55.6 | **56.0** | 56.9 | 56.8\nSVM | 33.7 | 15.2 | 17.4 | 21.5\nMNB | 31.3 | 9.1 | 19.6 | 17.2\nLR | 33.4 | 14.9 | 17.8 | 20.9\nMLP | 41.0 | 36.5 | 42.2 | 35.9\nLSTM | 43.6 | 43.4 | 53.2 | 40.6\nARE (4-class) | 56.3 | - | 54.6 | -\nE1 (4-class) | 56.2 | 45.9 | **67.6** | 48.9\n**E1** | **56.6** | 55.7 | 57.3 | **57.3**\n\nE1: Ensemble (RF + XGB + MLP)\n\n**Text**\n\nModels | Accuracy | F1 | Precision | Recall\n---|---|---|---|---\nRF | 62.2 | 60.8 | 65.0 | 62.0\nXGB | 56.9 | 55.0 | 70.3 | 51.8\nSVM | 62.1 | 61.7 | 62.5 | **63.5**\nMNB | 61.9 | 62.1 | **71.8** | 58.6\nLR | 64.2 | 64.3 | 69.5 | 62.3\nMLP | 60.6 | 61.5 | 62.4 | 63.0\nLSTM | 63.1 | 62.5 | 65.3 | 62.8\nTRE (4-class) | **65.5** | - | 63.5 | -\nE1 (4-class) | 63.1 | 61.4 | **67.7** | 59.0\n**E2** | 64.9 | **66.0** | 71.4 | 63.2\n\nE2: Ensemble (RF + XGB + MLP + MNB + LR)\nE1: Ensemble (RF + XGB + MLP)\n\n**Audio + Text**\n\nModels | Accuracy | F1 | Precision | Recall\n---|---|---|---|---\nRF | 65.3 | 65.8 | 69.3 | 65.5\nXGB | 62.2 | 63.1 | 67.9 | 61.7\nSVM | 63.4 | 63.8 | 63.1 | 65.6\nMNB | 60.5 | 60.3 | 70.3 | 57.1\nMLP | 66.1 | 68.1 | 68.0 | 69.6\nLR | 63.2 | 63.7 | 66.9 | 62.3\nLSTM | 64.2 | 64.7 | 66.1 | 65.0\nMDRE (4-class) | **75.3** | - | 71.8 | -\nE1 (4-class) | 70.3 | 67.5 | **73.2** | 65.5\n**E2** | 70.1 | **71.8** | 72.9 | **71.5**\n\nFor more details, please refer to the [report](https://arxiv.org/abs/1904.06022)\n\n## Citation\nIf you find this work useful, please cite:\n\n```\n@article{sahu2019multimodal,\n  title={Multimodal Speech Emotion Recognition and Ambiguity Resolution},\n  author={Sahu, Gaurav},\n  journal={arXiv preprint arXiv:1904.06022},\n  year={2019}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdemfier%2Fmultimodal-speech-emotion-recognition","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdemfier%2Fmultimodal-speech-emotion-recognition","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdemfier%2Fmultimodal-speech-emotion-recognition/lists"}