{"id":50414011,"url":"https://github.com/codersacademy006/speech-recognition-system","last_synced_at":"2026-05-31T05:02:35.934Z","repository":{"id":247891567,"uuid":"827147457","full_name":"CodersAcademy006/Speech-Recognition-System","owner":"CodersAcademy006","description":"The objective of this DLM (Deep Learning Model) is to recognize the emotions from speech.","archived":false,"fork":false,"pushed_at":"2024-07-11T11:59:57.000Z","size":62516,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-04T21:53:09.478Z","etag":null,"topics":["deep-learning","emotion-detection","emotion-recognition","emotion-recognizer","feature-extraction","gradient-boosting","keras","kneighborsclassifier","librosa","machine-learning","mfcc","mlp-classifier","neural-networks","random-forest-classifier","recurrent-neural-networks","sklearn","speech-emotion-recognition","support-vector-machine"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CodersAcademy006.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-07-11T05:20:07.000Z","updated_at":"2024-07-11T12:00:00.000Z","dependencies_parsed_at":"2025-04-12T00:29:25.963Z","dependency_job_id":null,"html_url":"https://github.com/CodersAcademy006/Speech-Recognition-System","commit_stats":null,"previous_names":["codersacademy006/speech-recognition-system"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CodersAcademy006/Speech-Recognition-System","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodersAcademy006%2FSpeech-Recognition-System","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodersAcademy006%2FSpeech-Recognition-System/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodersAcademy006%2FSpeech-Recognition-System/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodersAcademy006%2FSpeech-Recognition-System/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CodersAcademy006","download_url":"https://codeload.github.com/CodersAcademy006/Speech-Recognition-System/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodersAcademy006%2FSpeech-Recognition-System/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33719601,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","emotion-detection","emotion-recognition","emotion-recognizer","feature-extraction","gradient-boosting","keras","kneighborsclassifier","librosa","machine-learning","mfcc","mlp-classifier","neural-networks","random-forest-classifier","recurrent-neural-networks","sklearn","speech-emotion-recognition","support-vector-machine"],"created_at":"2026-05-31T05:02:35.774Z","updated_at":"2026-05-31T05:02:35.877Z","avatar_url":"https://github.com/CodersAcademy006.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Speech Emotion Recognition\n## Introduction\n- This repository handles building and training Speech Emotion Recognition System.\n- The basic idea behind this tool is to build and train/test a suited machine learning ( as well as deep learning ) algorithm that could recognize and detects human emotions from speech.\n- This is useful for many industry fields such as making product recommendations, affective computing, etc.\n- Check this [tutorial](https://www.thepythoncode.com/article/building-a-speech-emotion-recognizer-using-sklearn) for more information.\n## Requirements\n- **Python 3.6+**\n### Python Packages\n- **tensorflow**\n- **librosa==0.6.3**\n- **numpy**\n- **pandas**\n- **soundfile==0.9.0**\n- **wave**\n- **scikit-learn==0.24.2**\n- **tqdm==4.28.1**\n- **matplotlib==2.2.3**\n- **pyaudio==0.2.11**\n- **[ffmpeg](https://ffmpeg.org/) (optional)**: used if you want to add more sample audio by converting to 16000Hz sample rate and mono channel which is provided in ``convert_wavs.py``\n\nInstall these libraries by the following command:\n```\npip3 install -r requirements.txt\n```\n\n### Dataset\nThis repository used 4 datasets (including this repo's custom dataset) which are downloaded and formatted already in `data` folder:\n- [**RAVDESS**](https://zenodo.org/record/1188976) : The **R**yson **A**udio-**V**isual **D**atabase of **E**motional **S**peech and **S**ong that contains 24 actors (12 male, 12 female), vocalizing two lexically-matched statements in a neutral North American accent.\n- [**TESS**](https://tspace.library.utoronto.ca/handle/1807/24487) : **T**oronto **E**motional **S**peech **S**et that was modeled on the Northwestern University Auditory Test No. 6 (NU-6; Tillman \u0026 Carhart, 1966). A set of 200 target words were spoken in the carrier phrase \"Say the word _____' by two actresses (aged 26 and 64 years).\n- [**EMO-DB**](http://emodb.bilderbar.info/docu/) : As a part of the DFG funded research project SE462/3-1 in 1997 and 1999 we recorded a database of emotional utterances spoken by actors. The recordings took place in the anechoic chamber of the Technical University Berlin, department of Technical Acoustics. Director of the project was Prof. Dr. W. Sendlmeier, Technical University of Berlin, Institute of Speech and Communication, department of communication science. Members of the project were mainly Felix Burkhardt, Miriam Kienast, Astrid Paeschke and Benjamin Weiss.\n- **Custom** : Some unbalanced noisy dataset that is located in `data/train-custom` for training and `data/test-custom` for testing in which you can add/remove recording samples easily by converting the raw audio to 16000 sample rate, mono channel (this is provided in `create_wavs.py` script in ``convert_audio(audio_path)`` method which requires [ffmpeg](https://ffmpeg.org/) to be installed and in *PATH*) and adding the emotion to the end of audio file name separated with '_' (e.g \"20190616_125714_happy.wav\" will be parsed automatically as happy)\n\n\n### Emotions available\nThere are 9 emotions available: \"neutral\", \"calm\", \"happy\" \"sad\", \"angry\", \"fear\", \"disgust\", \"ps\" (pleasant surprise) and \"boredom\".\n## Feature Extraction\nFeature extraction is the main part of the speech emotion recognition system. It is basically accomplished by changing the speech waveform to a form of parametric representation at a relatively lesser data rate.\n\nIn this repository, we have used the most used features that are available in [librosa](https://github.com/librosa/librosa) library including:\n- [MFCC](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum)\n- Chromagram \n- MEL Spectrogram Frequency (mel)\n- Contrast\n- Tonnetz (tonal centroid features)\n\n## Grid Search\nGrid search results are already provided in `grid` folder, but if you want to tune various grid search parameters in `parameters.py`, you can run the script `grid_search.py` by:\n```\npython grid_search.py\n```\nThis may take several hours to complete execution, once it is finished, best estimators are stored and pickled in `grid` folder.\n\n## Example 1: Using 3 Emotions\nThe way to build and train a model for classifying 3 emotions is as shown below:\n```python\nfrom emotion_recognition import EmotionRecognizer\nfrom sklearn.svm import SVC\n# init a model, let's use SVC\nmy_model = SVC()\n# pass my model to EmotionRecognizer instance\n# and balance the dataset\nrec = EmotionRecognizer(model=my_model, emotions=['sad', 'neutral', 'happy'], balance=True, verbose=0)\n# train the model\nrec.train()\n# check the test accuracy for that model\nprint(\"Test score:\", rec.test_score())\n# check the train accuracy for that model\nprint(\"Train score:\", rec.train_score())\n```\n**Output:**\n```\nTest score: 0.8148148148148148\nTrain score: 1.0\n```\n### Determining the best model\nIn order to determine the best model, you can by:\n\n```python\n# loads the best estimators from `grid` folder that was searched by GridSearchCV in `grid_search.py`,\n# and set the model to the best in terms of test score, and then train it\nrec.determine_best_model()\n# get the determined sklearn model name\nprint(rec.model.__class__.__name__, \"is the best\")\n# get the test accuracy score for the best estimator\nprint(\"Test score:\", rec.test_score())\n```\n**Output:**\n```\nMLPClassifier is the best\nTest Score: 0.8958333333333334\n```\n### Predicting\nJust pass an audio path to the `rec.predict()` method as shown below:\n```python\n# this is a neutral speech from emo-db from the testing set\nprint(\"Prediction:\", rec.predict(\"data/emodb/wav/15a04Nc.wav\"))\n# this is a sad speech from TESS from the testing set\nprint(\"Prediction:\", rec.predict(\"data/validation/Actor_25/25_01_01_01_back_sad.wav\"))\n```\n**Output:**\n```\nPrediction: neutral\nPrediction: sad\n```\nYou can pass any audio file, if it's not in the appropriate format (16000Hz and mono channel), then it'll be automatically converted, make sure you have `ffmpeg` installed in your system and added to *PATH*.\n## Example 2: Using RNNs for 5 Emotions\n```python\nfrom deep_emotion_recognition import DeepEmotionRecognizer\n# initialize instance\n# inherited from emotion_recognition.EmotionRecognizer\n# default parameters (LSTM: 128x2, Dense:128x2)\ndeeprec = DeepEmotionRecognizer(emotions=['angry', 'sad', 'neutral', 'ps', 'happy'], n_rnn_layers=2, n_dense_layers=2, rnn_units=128, dense_units=128)\n# train the model\ndeeprec.train()\n# get the accuracy\nprint(deeprec.test_score())\n# predict angry audio sample\nprediction = deeprec.predict('data/validation/Actor_10/03-02-05-02-02-02-10_angry.wav')\nprint(f\"Prediction: {prediction}\")\n```\n**Output:**\n```\n0.7717948717948718\nPrediction: angry\n```\nPredicting probabilities is also possible (for classification ofc):\n```python\nprint(deeprec.predict_proba(\"data/emodb/wav/16a01Wb.wav\"))\n```\n**Output:**\n```\n{'angry': 0.99878675, 'sad': 0.0009922335, 'neutral': 7.959707e-06, 'ps': 0.00021298956, 'happy': 8.3598025e-08}\n```\n### Confusion Matrix\n```python\nprint(deeprec.confusion_matrix(percentage=True, labeled=True))\n```\n**Output:**\n```\n              predicted_angry  predicted_sad  predicted_neutral  predicted_ps  predicted_happy\ntrue_angry          80.769226       7.692308           3.846154      5.128205         2.564103\ntrue_sad            12.820514      73.076920           3.846154      6.410257         3.846154\ntrue_neutral         1.282051       1.282051          79.487183      1.282051        16.666668\ntrue_ps             10.256411       3.846154           1.282051     79.487183         5.128205\ntrue_happy           5.128205       8.974360           7.692308      8.974360        69.230774\n```\n## Example 3: Not Passing any Model and Removing the Custom Dataset\nBelow code initializes `EmotionRecognizer` with 3 chosen emotions while removing Custom dataset, and setting `balance` to `False`:\n```python\nfrom emotion_recognition import EmotionRecognizer\n# initialize instance, this will take a bit the first time executed\n# as it'll extract the features and calls determine_best_model() automatically\n# to load the best performing model on the picked dataset\nrec = EmotionRecognizer(emotions=[\"angry\", \"neutral\", \"sad\"], balance=False, verbose=1, custom_db=False)\n# it will be trained, so no need to train this time\n# get the accuracy on the test set\nprint(rec.confusion_matrix())\n# predict angry audio sample\nprediction = rec.predict('data/validation/Actor_10/03-02-05-02-02-02-10_angry.wav')\nprint(f\"Prediction: {prediction}\")\n```\n**Output:**\n```\n[+] Best model determined: RandomForestClassifier with 93.454% test accuracy\n\n              predicted_angry  predicted_neutral  predicted_sad\ntrue_angry          98.275864           1.149425       0.574713\ntrue_neutral         0.917431          88.073395      11.009174\ntrue_sad             6.250000           1.875000      91.875000\n\nPrediction: angry\n```\nYou can print the number of samples on each class:\n```python\nrec.get_samples_by_class()\n```\n**Output:**\n```\n         train  test  total\nangry      910   174   1084\nneutral    650   109    759\nsad        862   160   1022\ntotal     2422   443   2865\n```\nIn this case, the dataset is only from TESS and RAVDESS, and not balanced, you can pass `True` to `balance` on the `EmotionRecognizer` instance to balance the data.\n## Algorithms Used\nThis repository can be used to build machine learning classifiers as well as regressors for the case of 3 emotions {'sad': 0, 'neutral': 1, 'happy': 2} and the case of 5 emotions {'angry': 1, 'sad': 2, 'neutral': 3, 'ps': 4, 'happy': 5}\n### Classifiers\n- SVC\n- RandomForestClassifier\n- GradientBoostingClassifier\n- KNeighborsClassifier\n- MLPClassifier\n- BaggingClassifier\n- Recurrent Neural Networks (Keras)\n### Regressors\n- SVR\n- RandomForestRegressor\n- GradientBoostingRegressor\n- KNeighborsRegressor\n- MLPRegressor\n- BaggingRegressor\n- Recurrent Neural Networks (Keras)\n\n### Testing\nYou can test your own voice by executing the following command:\n```\npython test.py\n```\nWait until \"Please talk\" prompt is appeared, then you can start talking, and the model will automatically detects your emotion when you stop (talking).\n\nYou can change emotions to predict, as well as models, type ``--help`` for more information.\n```\npython test.py --help\n```\n**Output:**\n```\nusage: test.py [-h] [-e EMOTIONS] [-m MODEL]\n\nTesting emotion recognition system using your voice, please consider changing\nthe model and/or parameters as you wish.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -e EMOTIONS, --emotions EMOTIONS\n                        Emotions to recognize separated by a comma ',',\n                        available emotions are \"neutral\", \"calm\", \"happy\"\n                        \"sad\", \"angry\", \"fear\", \"disgust\", \"ps\" (pleasant\n                        surprise) and \"boredom\", default is\n                        \"sad,neutral,happy\"\n  -m MODEL, --model MODEL\n                        The model to use, 8 models available are: \"SVC\",\"AdaBo\n                        ostClassifier\",\"RandomForestClassifier\",\"GradientBoost\n                        ingClassifier\",\"DecisionTreeClassifier\",\"KNeighborsCla\n                        ssifier\",\"MLPClassifier\",\"BaggingClassifier\", default\n                        is \"BaggingClassifier\"\n\n```\n\n## Plotting Histograms\nThis will only work if grid search is performed.\n```python\nfrom emotion_recognition import plot_histograms\n# plot histograms on different classifiers\nplot_histograms(classifiers=True)\n```\n**Output:**\n\n\u003cimg src=\"images/Figure.png\"\u003e\n\u003cp align=\"center\"\u003eA Histogram shows different algorithms metric results on different data sizes as well as time consumed to train/predict.\u003c/p\u003e\n\n## Citation\n\n```bibtex\n@software{speech_emotion_recognition_2019,\n  author       = {Abdeladim Fadheli},\n  title        = {Speech Emotion Recognition},\n  version      = {1.0.0},\n  year         = {2019},\n  publisher    = {GitHub},\n  journal      = {GitHub repository},\n  url          = {https://github.com/x4nth055/emotion-recognition-using-speech}\n}\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodersacademy006%2Fspeech-recognition-system","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodersacademy006%2Fspeech-recognition-system","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodersacademy006%2Fspeech-recognition-system/lists"}