https://github.com/torinriley/bayesian-causal-inference

Bayesian causal inference model using BERT embeddings to estimate the causal effect of review length on sentiment polarity.
https://github.com/torinriley/bayesian-causal-inference

bayesian-statistics causal-inference

Last synced: 15 days ago
JSON representation

Bayesian causal inference model using BERT embeddings to estimate the causal effect of review length on sentiment polarity.

Host: GitHub
URL: https://github.com/torinriley/bayesian-causal-inference
Owner: torinriley
Created: 2024-12-27T21:30:02.000Z (6 months ago)
Default Branch: main
Last Pushed: 2024-12-27T21:42:15.000Z (6 months ago)
Last Synced: 2024-12-27T22:24:45.555Z (6 months ago)
Topics: bayesian-statistics, causal-inference
Language: Python
Homepage:
Size: 0 Bytes
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Bayesian Causal Inference

## Overview
This project demonstrates the use of **Bayesian causal inference** to investigate the relationship between **review length** (measured in words) and **sentiment polarity** (positive or negative) in Yelp reviews. The model adjusts for confounding variables in the review's content using **BERT embeddings**, enabling robust causal analysis of text data.

## Key Features

- **Causal Inference**:
- Estimates the causal effect of review length on sentiment polarity.
- Uses the **NUTS (No-U-Turn Sampler)** algorithm to perform posterior inference.
- **Natural Language Processing (NLP)**:
- Extracts semantic features from reviews using pretrained **BERT embeddings**.
- Adjusts for confounding factors in textual content.
- **Bayesian Modeling**:
- Implements a probabilistic framework with Pyro to model relationships and account for uncertainty.

## Methodology
1. **Data Preparation**:
- The **Yelp Polarity Dataset** is used, with a random 1% sample of the training data.
- Review content is tokenized and embedded using **BERT** ("bert-base-uncased").
- Features include:
- `X`: Review lengths (word count).
- `Z`: BERT embeddings (high-dimensional semantic representations).
- `Y`: Sentiment labels (binary).

2. **Causal Model**:
- The Bayesian model includes:
- **β (beta)**: The causal effect of review length on sentiment.
- **σ (sigma)**: Noise parameter accounting for variability in sentiment.
- **Z Weights**: Contributions of BERT embeddings to sentiment prediction.

3. **Inference**:
- The model uses the **NUTS algorithm** to sample from the posterior distribution of parameters.
- Posterior samples for `α`, `σ`, and `β` are analyzed to estimate the causal effect and its uncertainty.

## Results
- **Posterior Distributions**:
- Visualized the posterior distributions of hyperparameters (`α`, `σ`, `β`).
- Insights include:
- **β (causal effect)**: Indicates whether review length significantly influences sentiment polarity.
- **σ (noise)**: Captures unexplained variability in sentiment.

- **Key Findings**:
- Adjusting for semantic content (via BERT embeddings) highlights that textual content is a stronger predictor of sentiment than review length.

## Requirements
- Python 3.8+
- Libraries:
- `numpy`
- `torch`
- `pyro`
- `datasets`
- `transformers`
- `seaborn`
- `matplotlib`

Install dependencies using:
```bash
pip install numpy torch pyro-ppl datasets transformers seaborn matplotlib
```

## Visualization
- The script generates a plot of posterior distributions for the hyperparameters (`α`, `σ`, `β`), enabling interpretation of the model's outputs.

Screenshot 2024-12-27 at 3 24 59 PM

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/torinriley/bayesian-causal-inference

Awesome Lists containing this project

README