Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/torinriley/bayesian-causal-inference

Bayesian causal inference model using BERT embeddings to estimate the causal effect of review length on sentiment polarity.
https://github.com/torinriley/bayesian-causal-inference

bayesian-statistics causal-inference

Last synced: 26 days ago
JSON representation

Bayesian causal inference model using BERT embeddings to estimate the causal effect of review length on sentiment polarity.

Awesome Lists containing this project

README

        

# Bayesian Causal Inference

## Overview
This project demonstrates the use of **Bayesian causal inference** to investigate the relationship between **review length** (measured in words) and **sentiment polarity** (positive or negative) in Yelp reviews. The model adjusts for confounding variables in the review's content using **BERT embeddings**, enabling robust causal analysis of text data.

## Key Features

- **Causal Inference**:
- Estimates the causal effect of review length on sentiment polarity.
- Uses the **NUTS (No-U-Turn Sampler)** algorithm to perform posterior inference.
- **Natural Language Processing (NLP)**:
- Extracts semantic features from reviews using pretrained **BERT embeddings**.
- Adjusts for confounding factors in textual content.
- **Bayesian Modeling**:
- Implements a probabilistic framework with Pyro to model relationships and account for uncertainty.

## Methodology
1. **Data Preparation**:
- The **Yelp Polarity Dataset** is used, with a random 1% sample of the training data.
- Review content is tokenized and embedded using **BERT** ("bert-base-uncased").
- Features include:
- `X`: Review lengths (word count).
- `Z`: BERT embeddings (high-dimensional semantic representations).
- `Y`: Sentiment labels (binary).

2. **Causal Model**:
- The Bayesian model includes:
- **β (beta)**: The causal effect of review length on sentiment.
- **σ (sigma)**: Noise parameter accounting for variability in sentiment.
- **Z Weights**: Contributions of BERT embeddings to sentiment prediction.

3. **Inference**:
- The model uses the **NUTS algorithm** to sample from the posterior distribution of parameters.
- Posterior samples for `α`, `σ`, and `β` are analyzed to estimate the causal effect and its uncertainty.

## Results
- **Posterior Distributions**:
- Visualized the posterior distributions of hyperparameters (`α`, `σ`, `β`).
- Insights include:
- **β (causal effect)**: Indicates whether review length significantly influences sentiment polarity.
- **σ (noise)**: Captures unexplained variability in sentiment.

- **Key Findings**:
- Adjusting for semantic content (via BERT embeddings) highlights that textual content is a stronger predictor of sentiment than review length.

## Requirements
- Python 3.8+
- Libraries:
- `numpy`
- `torch`
- `pyro`
- `datasets`
- `transformers`
- `seaborn`
- `matplotlib`

Install dependencies using:
```bash
pip install numpy torch pyro-ppl datasets transformers seaborn matplotlib
```

## Visualization
- The script generates a plot of posterior distributions for the hyperparameters (`α`, `σ`, `β`), enabling interpretation of the model's outputs.

Screenshot 2024-12-27 at 3 24 59 PM