https://github.com/danishayman/speaker-change-detection
A Jupyter notebook implementation of speaker change detection using LSTM-based deep learning models on the IEMOCAP dataset.
https://github.com/danishayman/speaker-change-detection
deep-learning iemocap ipython-notebook jupyter-notebook lstm lstm-neural-networks machine-learning neural-network python
Last synced: 4 months ago
JSON representation
A Jupyter notebook implementation of speaker change detection using LSTM-based deep learning models on the IEMOCAP dataset.
- Host: GitHub
- URL: https://github.com/danishayman/speaker-change-detection
- Owner: danishayman
- Created: 2025-01-27T11:37:23.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-01-27T14:07:52.000Z (8 months ago)
- Last Synced: 2025-05-18T09:11:32.382Z (5 months ago)
- Topics: deep-learning, iemocap, ipython-notebook, jupyter-notebook, lstm, lstm-neural-networks, machine-learning, neural-network, python
- Language: Jupyter Notebook
- Homepage:
- Size: 976 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🎙️ Speaker Change Detection using Deep Learning
A Jupyter notebook implementation of speaker change detection using LSTM-based deep learning models on the IEMOCAP dataset.
## 📋 Overview
This project implements a speaker change detection system using LSTM networks in a Jupyter notebook format. The system processes audio features (MFCC and F0) to identify points in a conversation where speaker transitions occur.
## 🔧 Prerequisites
- Python 3.8+
- Jupyter Notebook/Lab
- TensorFlow 2.x
- librosa
- parselmouth
- numpy
- pandas
- matplotlib
- scikit-learn
- seaborn## 📦 Setup
1. Clone the repository:
```bash
git clone https://github.com/danishayman/Speaker-Change-Detection.git
cd Speaker-Change-Detection
```2. Install required packages:
```bash
pip install -r requirements.txt
```3. Download the IEMOCAP dataset:
- The dataset can be obtained from [Kaggle](https://www.kaggle.com/datasets/dejolilandry/iemocapfullrelease/data)
- Place the downloaded dataset in your working directory## 📓 Notebook Structure
The project is contained in a single Jupyter notebook with the following sections:
1. **Import Libraries**: Setting up necessary Python packages
2. **Feature Extraction**:
- Loading audio files
- Extracting MFCC and F0 features
- Defining sliding window parameters
3. **Data Preprocessing**:
- RTTM parsing
- Label generation
- Dataset splitting
4. **Model Development**:
- Building LSTM model
- Training with different window sizes
- Performance evaluation
5. **Results and Analysis**:
- Visualization of results
- Confusion matrix analysis
- Comprehensive performance metrics## 🚀 Features
- 🎵 Audio feature extraction (MFCC and F0)
- 🪟 Sliding window analysis with various sizes (3, 5, 7, 9 frames)
- 🤖 LSTM-based architecture with batch normalization
- 📊 Comprehensive evaluation metrics and visualizations
- 📈 Experiment analysis with different window sizes## 💻 Usage
1. Open the Jupyter notebook:
```bash
jupyter notebook speaker_change_detection.ipynb
```2. Ensure your IEMOCAP dataset path is correctly set in the notebook:
```python
base_path = "path/to/your/IEMOCAP/dataset"
```3. Run all cells sequentially to:
- Extract features
- Process data
- Train models
- Visualize results## 📊 Results
The model's performance across different window sizes:
- Best Window Size: 7 frames
- Peak Accuracy: 66.94%
- Precision: 0.0047
- Recall: 0.6593
- F1-Score: 0.0093## 🔄 Model Architecture
```python
Sequential([
Input(shape=input_shape),
LSTM(128, return_sequences=True),
BatchNormalization(),
Dropout(0.3),
LSTM(64),
BatchNormalization(),
Dense(32, activation='relu'),
Dropout(0.2),
Dense(1, activation='sigmoid')
])
```## 🛠️ Future Improvements
- [ ] Implement data augmentation techniques
- [ ] Explore attention mechanisms
- [ ] Add residual connections
- [ ] Implement curriculum learning
- [ ] Experiment with additional acoustic features
- [ ] Optimize batch size and training epochs with better hardware## 📚 Citation
```bibtex
@article{busso2008iemocap,
title = {IEMOCAP: Interactive emotional dyadic motion capture database},
author = {Busso, Carlos and Bulut, Murtaza and Lee, Chi-Chun and
Kazemzadeh, Abe and Mower, Emily and Kim, Samuel and
Chang, Jeannette and Lee, Sungbok and Narayanan, Shrikanth S},
journal = {Speech Communication},
volume = {50},
number = {11},
pages = {1150--1162},
year = {2008},
publisher = {Elsevier}
}
```## ⚠️ Note
The current implementation faces challenges with class imbalance and computational constraints. Future improvements should focus on addressing these limitations for better performance.