Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/amaiya/causalnlp

CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.
https://github.com/amaiya/causalnlp

causal-inference nlp

Last synced: 3 days ago
JSON representation

CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.

Host: GitHub
URL: https://github.com/amaiya/causalnlp
Owner: amaiya
License: apache-2.0
Created: 2021-05-30T00:48:47.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2024-06-15T16:46:24.000Z (7 months ago)
Last Synced: 2025-01-15T06:00:15.272Z (10 days ago)
Topics: causal-inference, nlp
Language: Jupyter Notebook
Homepage: https://amaiya.github.io/causalnlp/
Size: 15.6 MB
Stars: 144
Watchers: 7
Forks: 11
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

        # Welcome to CausalNLP

## What is CausalNLP?

> CausalNLP is a practical toolkit for causal inference with text as

> treatment, outcome, or “controlled-for” variable.

## Features

- Low-code [causal

  inference](https://amaiya.github.io/causalnlp/examples.html) in as

  little as two commands

- Out-of-the-box support for using [**text** as a “controlled-for”

  variable](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-a-positive-review-on-product-views?)

  (e.g., confounder)

- Built-in

  [Autocoder](https://amaiya.github.io/causalnlp/autocoder.html) that

  transforms raw text into useful variables for causal analyses (e.g.,

  topics, sentiment, emotion, etc.)

- Sensitivity analysis to [assess robustness of causal

  estimates](https://amaiya.github.io/causalnlp/causalinference.html#CausalInferenceModel.evaluate_robustness)

- Quick and simple [key driver

  analysis](https://amaiya.github.io/causalnlp/key_driver_analysis.html)

  to yield clues on potential drivers of an outcome based on predictive

  power, correlations, etc.

- Can easily be applied to [“traditional” tabular datasets without

  text](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-having-a-PhD-on-making-over-$50K?)

  (i.e., datasets with only numerical and categorical variables)

- Includes an experimental [PyTorch

  implementation](https://amaiya.github.io/causalnlp/core.causalbert.html)

  of [CausalBert](https://arxiv.org/abs/1905.12741) by Veitch, Sridar,

  and Blei (based on [reference

  implementation](https://github.com/rpryzant/causal-bert-pytorch) by R.

  Pryzant)

## Install

1.  `pip install -U pip`

2.  `pip install causalnlp`

**NOTE**: On Python 3.6.x, if you get a

`RuntimeError: Python version >= 3.7 required`, try ensuring NumPy is

installed **before** CausalNLP (e.g., `pip install numpy==1.18.5`).

## Usage

To try out the

[examples](https://amaiya.github.io/causalnlp/examples.html) yourself:



### Example: What is the causal impact of a positive review on a product click?

``` python

import pandas as pd

```

``` python

df = pd.read_csv('sample_data/music_seed50.tsv', sep='\t', on_bad_lines='skip')

```

The file `music_seed50.tsv` is a semi-simulated dataset from

[here](https://github.com/rpryzant/causal-text). Columns of relevance

include: - `Y_sim`: outcome, where 1 means product was clicked and 0

means not. - `text`: raw text of review - `rating`: rating associated

with review (1 through 5) - `T_true`: 0 means rating less than 3, 1

means rating of 5, where `T_true` affects the outcome `Y_sim`. - `T_ac`:

an approximation of true review sentiment (`T_true`) created with

[Autocoder](https://amaiya.github.io/causalnlp/autocoder.html) from raw

review text - `C_true`:confounding categorical variable (1=audio CD,

0=other)

We’ll pretend the true sentiment (i.e., review rating and `T_true`) is

hidden and only use `T_ac` as the treatment variable.

Using the `text_col` parameter, we include the raw review text as

another “controlled-for” variable.

``` python

from causalnlp import CausalInferenceModel

from lightgbm import LGBMClassifier

```

``` python

cm = CausalInferenceModel(df, 

                         metalearner_type='t-learner', learner=LGBMClassifier(num_leaves=500),

                         treatment_col='T_ac', outcome_col='Y_sim', text_col='text',

                         include_cols=['C_true'])

cm.fit()

```

    outcome column (categorical): Y_sim

    treatment column: T_ac

    numerical/categorical covariates: ['C_true']

    text covariate: text

    preprocess time:  1.1179866790771484  sec

    start fitting causal inference model

    time to fit causal inference model:  10.361494302749634  sec

#### Estimating Treatment Effects

CausalNLP supports estimation of heterogeneous treatment effects (i.e.,

how causal impacts vary across observations, which could be documents,

emails, posts, individuals, or organizations).

We will first calculate the overall average treatment effect (or ATE),

which shows that a positive review increases the probability of a click

by **13 percentage points** in this dataset.

**Average Treatment Effect** (or **ATE**):

``` python

print( cm.estimate_ate() )

```

    {'ate': 0.1309311542209525}

**Conditional Average Treatment Effect** (or **CATE**): reviews that

mention the word “toddler”:

``` python

print( cm.estimate_ate(df['text'].str.contains('toddler')) )

```

    {'ate': 0.15559234254638685}

**Individualized Treatment Effects** (or **ITE**):

``` python

test_df = pd.DataFrame({'T_ac' : [1], 'C_true' : [1], 

                        'text' : ['I never bought this album, but I love his music and will soon!']})

effect = cm.predict(test_df)

print(effect)

```

    [[0.80538201]]

**Model Interpretability**:

``` python

print( cm.interpret(plot=False)[1][:10] )

```

    v_music    0.079042

    v_cd       0.066838

    v_album    0.055168

    v_like     0.040784

    v_love     0.040635

    C_true     0.039949

    v_just     0.035671

    v_song     0.035362

    v_great    0.029918

    v_heard    0.028373

    dtype: float64

Features with the `v_` prefix are word features. `C_true` is the

categorical variable indicating whether or not the product is a CD.

### Text is Optional in CausalNLP

Despite the “NLP” in CausalNLP, the library can be used for causal

inference on data **without** text (e.g., only numerical and categorical

variables). See [the

examples](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-having-a-PhD-on-making-over-$50K?)

for more info.

## Documentation

API documentation and additional usage examples are available at:

https://amaiya.github.io/causalnlp/

## How to Cite

Please cite [the following paper](https://arxiv.org/abs/2106.08043) when

using CausalNLP in your work:

    @article{maiya2021causalnlp,

        title={CausalNLP: A Practical Toolkit for Causal Inference with Text},

        author={Arun S. Maiya},

        year={2021},

        eprint={2106.08043},

        archivePrefix={arXiv},

        primaryClass={cs.CL},

        journal={arXiv preprint arXiv:2106.08043},

    }