An open API service indexing awesome lists of open source software.

https://github.com/akshaysinhaaa/emova

A deep learning framework designed for emotion and sentiment recognition using text, audio, and video modalities. This project leverages the MELD (Multimodal EmotionLines Dataset) to train a robust and flexible model that reflects human communication more accurately than unimodal models.
https://github.com/akshaysinhaaa/emova

bert cnn cuda deep-learning multimodal python pytorch resnet-18 tensorboard transformers

Last synced: about 2 months ago
JSON representation

A deep learning framework designed for emotion and sentiment recognition using text, audio, and video modalities. This project leverages the MELD (Multimodal EmotionLines Dataset) to train a robust and flexible model that reflects human communication more accurately than unimodal models.

Awesome Lists containing this project

README

          

# ConvEmoSentNet: A Parameter-Efficient Framework for Multimodal Emotion and Sentiment Analysis in Social Media Conversations

A deep learning framework designed for **emotion and sentiment recognition** using **text**, **audio**, and **video** modalities. This project leverages the **MELD (Multimodal EmotionLines Dataset)** to train a robust and flexible model that reflects human communication more accurately than unimodal models.

---

## ๐Ÿ“ฆ Dataset: MELD

**Multimodal EmotionLines Dataset (MELD)** is a large-scale, multi-party conversation dataset derived from the TV series *Friends*. It provides aligned and synchronized **text**, **audio**, and **video** data, annotated with both **emotion** and **sentiment** labels.

- **Modalities**:
- `Text`: Dialogues (utterances)
- `Audio`: Speaker voice tone
- `Video`: Speaker facial expressions and posture

- **Emotion Labels**:
- Anger
- Disgust
- Fear
- Joy
- Neutral
- Sadness
- Surprise

- **Sentiment Labels**:
- Positive
- Negative
- Neutral

๐Ÿ”— [MELD Dataset GitHub](https://github.com/declare-lab/MELD)

---

## ๐Ÿง  Model Architecture

The model is **modular** and allows training on individual or fused modalities: `Text`, `Audio`, and `Video`. It is designed to perform well when one or more modalities are missing or unavailable.

### ๐Ÿ”น Individual Modality Encoders

| Modality | Model Used | Preprocessing |
|----------|--------------------|--------------------------------|
| Text | BERT | Tokenization, Padding |
| Audio | CNN | MFCC / Log-Mel Spectrogram |
| Video | ResNet18 / 3D-CNN | Face Extraction, Frame Sampling|

### ๐Ÿ”น Multimodal Fusion Strategy

- Concatenation of latent vectors from each modality
- Optional **attention mechanism** to weight more informative modalities
- Final **Fully Connected Layers** leading to classification head (Softmax)

```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Text โ”‚ โ”‚ Audio โ”‚ โ”‚ Video โ”‚
โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€ โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ”‚ โ”‚
BERT CNN 3D CNN / ResNet
โ”‚ โ”‚ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ Fusion โ”‚
โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜
Fully Connected
Softmax
```

---

## ๐Ÿงช Training Details

- **Optimizer**: Adam
- **Scheduler**: ReduceLROnPlateau
- **Loss Function**:
- CrossEntropyLoss for multiclass emotion classification
- Label Smoothing (0.1) to prevent overconfidence
- **Regularization**:
- Dropout in FC layers (0.3โ€“0.5)
- Early Stopping based on validation loss
- **Batch Size**: 16โ€“32
- **Epochs**: 15โ€“25

### ๐Ÿงต Hyperparameter Tuning

- Performed manually (grid search) on:
- Learning rate (1e-3 to 1e-5)
- Hidden layer sizes
- Dropout rates
- Fusion strategies (early vs late fusion)

---

## ๐Ÿ“ˆ Performance Snapshot

| Configuration | Emo Precision | Emo Acc. | Sen Precision | Sen Acc. |
|------------------------|---------------|----------|---------------|----------|
| Fused Model | 53.50% | 54.90% | 64.40% | 64.60% |

---

## ๐Ÿง‘โ€๐Ÿ’ป Author

**Akshay Sinha, Gauri Saksena, Yash Chandel**
_Deep Learning | Multimodal AI | Emotion Recognition_