Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/torinriley/bayesian-causal-inference
Bayesian causal inference model using BERT embeddings to estimate the causal effect of review length on sentiment polarity.
https://github.com/torinriley/bayesian-causal-inference
bayesian-statistics causal-inference
Last synced: 26 days ago
JSON representation
Bayesian causal inference model using BERT embeddings to estimate the causal effect of review length on sentiment polarity.
- Host: GitHub
- URL: https://github.com/torinriley/bayesian-causal-inference
- Owner: torinriley
- Created: 2024-12-27T21:30:02.000Z (29 days ago)
- Default Branch: main
- Last Pushed: 2024-12-27T21:42:15.000Z (28 days ago)
- Last Synced: 2024-12-27T22:24:45.555Z (28 days ago)
- Topics: bayesian-statistics, causal-inference
- Language: Python
- Homepage:
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Bayesian Causal Inference
## Overview
This project demonstrates the use of **Bayesian causal inference** to investigate the relationship between **review length** (measured in words) and **sentiment polarity** (positive or negative) in Yelp reviews. The model adjusts for confounding variables in the review's content using **BERT embeddings**, enabling robust causal analysis of text data.## Key Features
- **Causal Inference**:
- Estimates the causal effect of review length on sentiment polarity.
- Uses the **NUTS (No-U-Turn Sampler)** algorithm to perform posterior inference.
- **Natural Language Processing (NLP)**:
- Extracts semantic features from reviews using pretrained **BERT embeddings**.
- Adjusts for confounding factors in textual content.
- **Bayesian Modeling**:
- Implements a probabilistic framework with Pyro to model relationships and account for uncertainty.## Methodology
1. **Data Preparation**:
- The **Yelp Polarity Dataset** is used, with a random 1% sample of the training data.
- Review content is tokenized and embedded using **BERT** ("bert-base-uncased").
- Features include:
- `X`: Review lengths (word count).
- `Z`: BERT embeddings (high-dimensional semantic representations).
- `Y`: Sentiment labels (binary).2. **Causal Model**:
- The Bayesian model includes:
- **β (beta)**: The causal effect of review length on sentiment.
- **σ (sigma)**: Noise parameter accounting for variability in sentiment.
- **Z Weights**: Contributions of BERT embeddings to sentiment prediction.3. **Inference**:
- The model uses the **NUTS algorithm** to sample from the posterior distribution of parameters.
- Posterior samples for `α`, `σ`, and `β` are analyzed to estimate the causal effect and its uncertainty.## Results
- **Posterior Distributions**:
- Visualized the posterior distributions of hyperparameters (`α`, `σ`, `β`).
- Insights include:
- **β (causal effect)**: Indicates whether review length significantly influences sentiment polarity.
- **σ (noise)**: Captures unexplained variability in sentiment.- **Key Findings**:
- Adjusting for semantic content (via BERT embeddings) highlights that textual content is a stronger predictor of sentiment than review length.## Requirements
- Python 3.8+
- Libraries:
- `numpy`
- `torch`
- `pyro`
- `datasets`
- `transformers`
- `seaborn`
- `matplotlib`Install dependencies using:
```bash
pip install numpy torch pyro-ppl datasets transformers seaborn matplotlib
```## Visualization
- The script generates a plot of posterior distributions for the hyperparameters (`α`, `σ`, `β`), enabling interpretation of the model's outputs.