Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kr1shnasomani/tonesense
Speech emotion recognition from audio clips using CNN
https://github.com/kr1shnasomani/tonesense
deep-learning keras librosa matplotlib neural-network pandas scikit-learn seaborn tensorflow
Last synced: 27 days ago
JSON representation
Speech emotion recognition from audio clips using CNN
- Host: GitHub
- URL: https://github.com/kr1shnasomani/tonesense
- Owner: kr1shnasomani
- Created: 2024-12-10T18:53:55.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2024-12-18T08:51:07.000Z (29 days ago)
- Last Synced: 2024-12-18T09:38:54.746Z (29 days ago)
- Topics: deep-learning, keras, librosa, matplotlib, neural-network, pandas, scikit-learn, seaborn, tensorflow
- Language: Jupyter Notebook
- Homepage:
- Size: 1.46 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
ToneSense
The project leverages Librosa for audio feature extraction, including MFCCs and spectral features and uses TensorFlow to develop a deep learning model for classifying emotions like happiness, sadness and anger. It incorporates spectrogram analysis for visual insights and supports real-time emotion recognition for practical applications.## Execution Guide:
1. Run the following command line in the terminal:
```
pip install numpy pandas librosa matplotlib seaborn tensorflow keras sounddevice
```2. Download the dataset (link to the dataset: **https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess**)
3. Upon running the code, it also saves an addition file named `model.keras` (this file stores the trained model)
4. Enter the path of the audio file in the code and it will provide the prediction
## Accuracy & Loss Over Epochs:
![image](https://github.com/user-attachments/assets/e338a8ab-2a28-48ed-869a-ef683602cbc5)
![image](https://github.com/user-attachments/assets/95158724-54a8-4734-a009-72510c038877)
## Model Prediction:
![image](https://github.com/user-attachments/assets/1609e53a-ac10-46f6-8ddd-c413baddc4d1)
## Overview:
The code is a complete implementation for **Speech Emotion Recognition (SER)**. The goal of this project is to predict the emotional state (e.g., happy, sad, angry) of a speaker based on their speech audio files. Below is an overview of the code's key components:### **1. Dataset Handling**
- **Dataset**: The Toronto Emotional Speech Set (TESS), downloaded from Kaggle, contains labeled speech audio files corresponding to various emotions.
- **Labels**: The emotions include *fear, sad, angry, disgust, pleasant surprise (ps), neutral*, and *happy*. The labels are extracted from file names.
- **MFCC Features**:
- The **Mel-frequency cepstral coefficients (MFCC)**, which are widely used for speech processing, are computed for each audio file. MFCCs capture the frequency distribution of the audio signal in a way that reflects how humans perceive sound.### **2. Data Preparation**
- **Feature Extraction**:
- A function is defined to compute 40 MFCC coefficients for each audio file. These coefficients are averaged over time for dimensionality reduction.
- The MFCCs are stored in a numpy array, reshaped into `(2800, 40, 1)` for compatibility with neural networks.
- **Label Encoding**:
- Labels are one-hot encoded (e.g., `[1, 0, 0, 0, 0, 0, 0]` for the emotion "fear").
- **Train-Test Split**: The dataset is divided into 80% training and 20% testing.### **3. Model Architecture**
The model is a **Sequential Neural Network** with the following layers:
- **LSTM Layer**: A Long Short-Term Memory (LSTM) layer processes sequential MFCC features to capture temporal dependencies in the speech data.
- **Batch Normalization**: Stabilizes and accelerates training by normalizing intermediate layer outputs.
- **Dense Layers**: Fully connected layers with decreasing units (256 → 128 → 64 → 32) for feature transformation.
- **Dropout Layers**: Applied after each dense layer to reduce overfitting.
- **Output Layer**: A dense layer with 7 units (one for each emotion) and a softmax activation to produce probabilities.### **4. Model Training**
- **Loss Function**: Categorical crossentropy, used for multi-class classification.
- **Optimizer**: Adam optimizer for adaptive learning rate adjustments.
- **Evaluation Metrics**: Accuracy is used to monitor training and validation performance.
- **Visualization**:
- Training and validation accuracy and loss are plotted over epochs.### **5. Model Evaluation**
- The trained model achieves a high **validation accuracy of 97.86%**.
- The model is saved as a `.keras` file for future use.### **6. Prediction and Visualization**
- **Emotion Prediction**:
- A function predicts the emotion of a given audio file using the trained model and provides a confidence score.
- **Audio Playback and Visualization**:
- The `play_audio()` function plays the audio file.
- The `display_audio_visuals()` function generates waveforms and spectrograms of the audio signal, providing insights into its structure and intensity.### **Key Features of the Code**
- **End-to-End Pipeline**: Covers data loading, preprocessing, feature extraction, model training, evaluation, and prediction.
- **Visualization**: Includes spectrograms and waveforms for intuitive understanding of audio signals.
- **Scalability**: The modular structure allows easy experimentation with different datasets, features, or models.This project highlights how deep learning can be applied to **audio signal processing** for practical applications like human-computer interaction, mental health monitoring, and emotion-aware systems.