https://github.com/akshaysinhaaa/emova
A deep learning framework designed for emotion and sentiment recognition using text, audio, and video modalities. This project leverages the MELD (Multimodal EmotionLines Dataset) to train a robust and flexible model that reflects human communication more accurately than unimodal models.
https://github.com/akshaysinhaaa/emova
bert cnn cuda deep-learning multimodal python pytorch resnet-18 tensorboard transformers
Last synced: about 2 months ago
JSON representation
A deep learning framework designed for emotion and sentiment recognition using text, audio, and video modalities. This project leverages the MELD (Multimodal EmotionLines Dataset) to train a robust and flexible model that reflects human communication more accurately than unimodal models.
- Host: GitHub
- URL: https://github.com/akshaysinhaaa/emova
- Owner: akshaysinhaaa
- Created: 2025-04-06T17:39:12.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-24T19:32:32.000Z (about 1 year ago)
- Last Synced: 2025-05-24T20:31:42.915Z (about 1 year ago)
- Topics: bert, cnn, cuda, deep-learning, multimodal, python, pytorch, resnet-18, tensorboard, transformers
- Language: Python
- Homepage:
- Size: 31.3 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ConvEmoSentNet: A Parameter-Efficient Framework for Multimodal Emotion and Sentiment Analysis in Social Media Conversations
A deep learning framework designed for **emotion and sentiment recognition** using **text**, **audio**, and **video** modalities. This project leverages the **MELD (Multimodal EmotionLines Dataset)** to train a robust and flexible model that reflects human communication more accurately than unimodal models.
---
## ๐ฆ Dataset: MELD
**Multimodal EmotionLines Dataset (MELD)** is a large-scale, multi-party conversation dataset derived from the TV series *Friends*. It provides aligned and synchronized **text**, **audio**, and **video** data, annotated with both **emotion** and **sentiment** labels.
- **Modalities**:
- `Text`: Dialogues (utterances)
- `Audio`: Speaker voice tone
- `Video`: Speaker facial expressions and posture
- **Emotion Labels**:
- Anger
- Disgust
- Fear
- Joy
- Neutral
- Sadness
- Surprise
- **Sentiment Labels**:
- Positive
- Negative
- Neutral
๐ [MELD Dataset GitHub](https://github.com/declare-lab/MELD)
---
## ๐ง Model Architecture
The model is **modular** and allows training on individual or fused modalities: `Text`, `Audio`, and `Video`. It is designed to perform well when one or more modalities are missing or unavailable.
### ๐น Individual Modality Encoders
| Modality | Model Used | Preprocessing |
|----------|--------------------|--------------------------------|
| Text | BERT | Tokenization, Padding |
| Audio | CNN | MFCC / Log-Mel Spectrogram |
| Video | ResNet18 / 3D-CNN | Face Extraction, Frame Sampling|
### ๐น Multimodal Fusion Strategy
- Concatenation of latent vectors from each modality
- Optional **attention mechanism** to weight more informative modalities
- Final **Fully Connected Layers** leading to classification head (Softmax)
```
โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ
โ Text โ โ Audio โ โ Video โ
โโโโโโฌโโโโโโโโ โโโ โฌโโโโโโโ โ โโโโโโฌโโโโโโโโ
โ โ โ
BERT CNN 3D CNN / ResNet
โ โ โ
โโโโโโโโโโโโโโฌโโโโดโโโโโฌโโโโโโโโโโโโโโโ
โ Fusion โ
โโโโโโฌโโโโ
Fully Connected
Softmax
```
---
## ๐งช Training Details
- **Optimizer**: Adam
- **Scheduler**: ReduceLROnPlateau
- **Loss Function**:
- CrossEntropyLoss for multiclass emotion classification
- Label Smoothing (0.1) to prevent overconfidence
- **Regularization**:
- Dropout in FC layers (0.3โ0.5)
- Early Stopping based on validation loss
- **Batch Size**: 16โ32
- **Epochs**: 15โ25
### ๐งต Hyperparameter Tuning
- Performed manually (grid search) on:
- Learning rate (1e-3 to 1e-5)
- Hidden layer sizes
- Dropout rates
- Fusion strategies (early vs late fusion)
---
## ๐ Performance Snapshot
| Configuration | Emo Precision | Emo Acc. | Sen Precision | Sen Acc. |
|------------------------|---------------|----------|---------------|----------|
| Fused Model | 53.50% | 54.90% | 64.40% | 64.60% |
---
## ๐งโ๐ป Author
**Akshay Sinha, Gauri Saksena, Yash Chandel**
_Deep Learning | Multimodal AI | Emotion Recognition_