Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/naoki-egami/DIDdesign

R package DIDdesign: Analyzing Difference-in-Differences Design
https://github.com/naoki-egami/DIDdesign

Last synced: 1 day ago
JSON representation

R package DIDdesign: Analyzing Difference-in-Differences Design

Awesome Lists containing this project

README

        

---
title: "DIDdesign"
output:
md_document:
variant: gfm
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
cache = TRUE,
warning = FALSE,
message = FALSE
)
```

logo


DIDdesign:Analyzing Difference-in-Differences Design

[![R build status](https://github.com/naoki-egami/DIDdesign/workflows/R-CMD-check/badge.svg)](https://github.com/naoki-egami/DIDdesign/actions)

Authors:

- [Naoki Egami](https://naokiegami.com/)
- [Soichiro Yamauchi](https://soichiroy.github.io/)

Reference:

- Egami and Yamauchi (2023) “[Using Multiple Pre-treatment Periods to Improve Difference-in-Differences and Staggered Adoption Designs.](https://doi.org/10.1017/pan.2022.8)” _Political Analysis_ 31(2):195-212.

## Installation Instructions:

- Downloading the most recent version of `DIDdesign` from Github

```r
## need to install `devtools` if necessary
library(devtools)
install_github("naoki-egami/DIDdesign", dependencies = TRUE)
```

## Table of Contents

1. [Overview](#Overview-and-Workflow-in-`DIDdesign`)
2. [Basic DID design with panel data](#The-Basic-Difference-in-Differences-Design-with-Panel-Data)
3. [Basic DID design with repeated cross-section data](#The-Basic-DID-Design-with-Repeated-Cross-sectional-Data)
4. [Staggered adoption design](#Staggered-Adoption-Design)

## Overview and Workflow in `DIDdesign`

### Workflow

- **Step 1**: Assessing the parallel trends assumption via `did_check()`
- View and visualize the output via `plot()` and `summary()`
- **Step 2**: Estimating the treatment effects via `did()`
- View and visualize the output via `plot()` and `summary()`

## The Basic Difference-in-Differences Design with Panel Data

```{r, message=FALSE,warning=FALSE}
## load package
library(DIDdesign)
library(tidyverse)

## load data
data(anzia2012)
```

In the basic DID design, units receive the treatment at the same time.
In `anzia2012` dataset, the treatment assignment happens in 2007.

```{r basic_did_plot, echo=FALSE, dpi=250, warning=FALSE, out.width="85%", message=FALSE}
basic_did <- anzia2012 %>%
select(district, year, oncycle) %>%
pivot_wider(
id_cols = "district", names_prefix = "year",
names_from = year, values_from = oncycle
) %>%
column_to_rownames(var = "district") %>%
arrange(desc(year2007)) %>%
data.matrix()
basic_did <- basic_did[c(1:20, 300:350), ]

## plot
par(mar = c(2, 2, 0.5, 0.5))
image(t(apply(basic_did, 2, rev)),
col = c("lightgray", "#1E88A8"),
xaxt = "n", yaxt = "n"
)
axis(1,
at = seq(0, 1, length.out = ncol(basic_did)),
labels = str_remove(colnames(basic_did), "year"),
las = 1, cex.axis = 0.8, tick = FALSE, mgp = c(0, 0, 0.2)
)
# axis(2, at = seq(0, 1, length.out = nrow(basic_did)),
# labels = rownames(basic_did),
# las = 2, cex.axis = 0.65, tick = FALSE, mgp = c(0, 0, 0.2))
title(ylab = "District", line = 1)
diff_h <- min(diff(seq(0, 1, length.out = nrow(basic_did))))
diff_v <- min(diff(seq(0, 1, length.out = ncol(basic_did))))
abline(h = seq(0, 1, length.out = nrow(basic_did)) + diff_h / 2, lty = 1, col = "white", lwd = 0.3)
abline(v = seq(0, 1, length.out = ncol(basic_did)) + diff_v / 2, lty = 1, col = "white", lwd = 0.3)
legend("topright", legend = c("Treated", "Control"), fill = c("#1E88A8", "lightgray"), bg = "white", cex = 0.8)
box(lwd = 1.1)
```

### Step 1: Assess the parallel trends assumption

As the first step of the double DID method, users can check if the parallel trends assumption is plausible in the pre-treatment periods.
`did_check()` function estimates statistics for testing the parallel trends
and computes the equivalence confidence intervals.

```{r panel_check}
## check parallel trends
set.seed(1234)
check_panel <- did_check(
formula = lnavgsalary_cpi ~ oncycle | teachers_avg_yrs_exper +
ami_pc + asian_pc + black_pc + hisp_pc,
data = anzia2012,
id_unit = "district",
id_time = "year",
option = list(n_boot = 200, parallel = TRUE, lag = 1:3)
)
```

`did_check()` function takes the following arguments:

| Argument | Description |
|:-------- | :------------------------|
|`formula` | A formula specifying variables. It should follow the form of `outcome ~ treatment | covariates`.
-`treatment` should be time-varying, that is, `treatment` takes zero for everyone before the treatment assignment, and takes 1 for units who are treated. See the example for how the treatment variable should be coded.
-`covariates` can be omitted as `outcome ~ treatment`.|
| `data` | A data frame. This can be either `data.frame` or `tibble`. |
| `id_unit` | A variable name in the data that uniquely identifies units (e.g., individuals or states) |
| `id_time` | A variable name in the data that uniquely identifies time (e.g., year). |
| `design` | Design option. It should be `"did"` when the basic DID design is used. |
| `is_panel`| A boolean argument to indicate the type of the data. When the dataset is panel (i.e., same observations are measured repeatedly overtime), it should take `TRUE`. See the next section for how to analyze the repeated cross-section data. |
| `option` | A list of option parameters.
- `n_boot`: Number of bootstrap iterations to estimate weighting matrix.
- `parallel`: A boolean argument. If `TRUE`, bootstrap is conducted in parallel using `future` package.
- `lag`: A vector of positive lag values. For example, when `lag = c(1, 2)`, pre-treatment trends are tested on the period between `t-2` to `t-1` (corresponding to `lag = 1`), and between `t-3` and `t-2` where `t` is when the actual treatment is assigned. Default is `lag = 1`.
- `skip_standardize`: A boolean argument. If `TRUE`, use the original scale of the outcome to compute the equivalence regions. Useful when the outcome of the control group does not have any variation for some time periods. Default is `FALSE`.|

#### Assessing the output from `did_check()`

The output from `did_check()` function can be accessed by `summary()` function, which reports estimates for the pre-treatment parallel trends as well as the 95% standardized equivalence confidence interval.

```{r panel_check_summary}
## view estimates
summary(check_panel)
```

- `estimate` shows the DID estimates on the pre-treatment periods. Deviation from zero suggests the possible violation of the parallel trends assumption.
- `std.error` shows the standard errors for the estimates reported on the `estimate` column.
- `EqCI95_LB` and `EqCI95_UB` show the upper and lower bound on the 95% standardized equivalence confidence intervals. Values are standardized such that they can be interpreted as the standard deviation from the mean of the baseline control group. For example, `EqCI95_LB` for `lag = 1` can be interpreted as `EqCI95_LB` standard deviation away from the mean of the control group at time `t - 2` (`lag = 1` corresponds to the parallel trends check between time `t-2` and `t-1`). Wider intervals suggests the possible violation of the parallel trends assumption.

The output can be also visualized by `plot()` function. By default, it shows a plot for the 95% standardized equivalence confidence intervals on the left, and a plot of the observed trends on the right.
The generated plots are in `ggplot` object, thus users can add additional attributes using functions from `ggplot2`.

```{r panel_check_plot, dpi=250, fig.height = 4, fig.width = 7, out.width="100%", fig.align="center"}
## visualize the estimates
plot(check_panel)
```

- Data used to generate the above plot are available via

```r
## data for the trend-plot
check_panel$plot[[1]]$dat_plot

## data for the equivalence plot
check_panel$plot[[2]]$dat_plot
```
- Individual plots are also available via

```r
## trend plot
check_panel$plot[[1]]$plot

## equivalence plot
check_panel$plot[[2]]$plot
```

### Step 2: Estimate the treatment effect with the double DID estimator

After assessing the parallel trends assumption with `did_check()`,
we can estimate the average treatment effect on the treated (ATT) via `did()`.

```{r panel_fit}
## estimate treatment effect
set.seed(1234)
fit_panel <- did(
formula = lnavgsalary_cpi ~ oncycle | teachers_avg_yrs_exper +
ami_pc + asian_pc + black_pc + hisp_pc,
data = anzia2012,
id_unit = "district",
id_time = "year",
design = "did",
is_panel = TRUE,
option = list(n_boot = 200, parallel = TRUE, lead = 0:2, se_boot = TRUE)
)
```

`did()` function inherits most of the arguments in `did_check()`.
Different from `did_check()`, `did()` takes `lead` parameter in the option argument.

| Argument | Description |
|:-------- | :---------- |
| `lead` | A parameter in `option` argument. It should be a vector of non-negative lead values. For example, when `lead = c(0, 1)`, treatment effect when the treatment is assigned (`lead = 0`) as well as one-time ahead effect (`lead = 1`) will be estimated. Default is `lead = 0`. |

#### Assessing the output from `did()`

Users can obtain the estimates via `summary()` function.

```{r panel_fit_summary}
## view the estimates
summary(fit_panel)
```

- `estimator`
- `Double-DID` shows estimates from the double DID estimator that combines the extended parallel trends assumption. This estimate should be used only when the parallel trends assumption is plausible.
- `DID` shows estimates from the standard DID estimator.
- `sDID` shows estimates from the sequential DID estimator that requires a weaker parallel trends-in-trends assumption. When the parallel trends assumption is not plausible, this estimator should be used.

`plot()` function for the output from `did()` can be used in two ways.
First, it generates a treatment effect plot when the function is provided an output from `did()`.
Second, it generates a plot that adds the pre-treatment trend check in addition to the treatment effect estimates when the function is provided an output from `did()` as well as `did_check()`.

```{r panel_fit_plot, dpi=250, fig.height = 4, fig.width = 7, out.width="100%", fig.align="center"}
# plot only treatment effects
post_plot <- plot(fit_panel)

# plot treatment effects + pre-treatment assessment
pre_post_plot <- plot(fit_panel, check_fit = check_panel)

## show the plots side-by-side
require(patchwork)
(post_plot +
ggplot2::theme(aspect.ratio = 1) +
ggplot2::ylim(-0.015, 0.01) +
ggplot2::labs(title = "Post-Treatment")) +
(pre_post_plot +
ggplot2::theme(aspect.ratio = 1) +
ggplot2::ylim(-0.015, 0.01) +
ggplot2::labs(title = "Pre- and Post-Treatment"))
```

## The Basic DID Design with Repeated Cross-sectional Data

Sometimes, each period consists of different units, instead of repeated observations of the same units.
`did()` can handle such "repeated cross-sectional" data by setting `is_panel = FALSE`.
As an example, we analyze `malesky2014` dataset (see `?malesky2014` for more details on this dataset).

### Step 1: Assess the pre-treatment parallel trends

```{r, rcs_check}
## load data
data(malesky2014)

## check parallel trends
set.seed(1234)
check_rcs <- did_check(
formula = vpost ~ treatment + post_treat | factor(city),
data = malesky2014,
id_time = "year",
is_panel = FALSE,
option = list(n_boot = 200, parallel = TRUE, id_cluster = "id_district", lag = 1)
)
```

`did_check()` and `did()` for repeated cross-sectional data accept a slightly different argument
from the case of panel data.

| Argument | Description |
|:-------- | :---------- |
| `formula`| It should include the post-treatment indicator, in addition to the time-invariant `treatment` variable. Covariates are supported as in the panel case.|
| `is_panel` | It should be `FALSE` to indicate that the data is in the repeated cross-sectional format. |
| `id_cluster` | A parameter for `option` argument. It should be a variable name used to cluster the standard errors.|

```{r rcs_check_summary}
## summary
summary(check_rcs)
```

```{r rcs_check_plot, dpi=250, fig.height = 4, fig.width = 7, out.width="100%", fig.align="center"}
## view plot
plot(check_rcs)
```

### Step 2: Estimate causal effects

```{r rcs_estimate}
## estimate ATT
ff_rcs <- did(
formula = vpost ~ treatment + post_treat | factor(city),
data = malesky2014,
id_time = "year",
is_panel = FALSE,
option = list(
n_boot = 200, parallel = TRUE, se_boot = TRUE,
id_cluster = "id_district", lead = 0
)
)
```

```{r rcs_est_summary}
## view summary
summary(ff_rcs)
```

## The Staggered Adoption Design

`DIDdesign` supports the staggered adoption design where units receive the treatment at different periods of time.
As an example, we analyze `paglayan2019` dataset in the package (see `?paglayan2019` for more details about this dataset).

```{r sa_data, warning=FALSE, message=FALSE}
## data
require(dplyr)
require(tibble)

## format dataset
paglayan2019 <- paglayan2019 %>%
filter(!(state %in% c("WI", "DC"))) %>%
mutate(
id_time = year,
id_subject = as.numeric(as.factor(state)),
log_expenditure = log(pupil_expenditure + 1),
log_salary = log(teacher_salary + 1)
)
```

```{r sa_plot, echo=FALSE, dpi=250, warning=FALSE, message=FALSE, out.width="85%"}
require(tidyr)
require(stringr)

treatment_variation <- paglayan2019 %>%
select(state, year, treatment) %>%
group_by(state) %>%
mutate(treatment_sum = sum(treatment)) %>%
arrange(desc(treatment_sum), state, year) %>%
select(-treatment_sum) %>%
pivot_wider(
id_cols = "state", names_prefix = "year",
names_from = year, values_from = treatment
) %>%
ungroup() %>%
column_to_rownames(var = "state") %>%
data.matrix()

## plot
par(mar = c(2, 2, 0.5, 0.5))
image(t(apply(treatment_variation, 2, rev)),
col = c("lightgray", "#1E88A8"),
xaxt = "n", yaxt = "n"
)
axis(1,
at = seq(0, 1, length.out = ncol(treatment_variation)),
labels = str_remove(colnames(treatment_variation), "year"),
las = 2, cex.axis = 0.8, tick = FALSE, mgp = c(0, 0, 0.2)
)
axis(2,
at = seq(0, 1, length.out = nrow(treatment_variation)),
labels = rownames(treatment_variation),
las = 2, cex.axis = 0.65, tick = FALSE, mgp = c(0, 0, 0.2)
)
diff_h <- min(diff(seq(0, 1, length.out = nrow(treatment_variation))))
diff_v <- min(diff(seq(0, 1, length.out = ncol(treatment_variation))))
abline(h = seq(0, 1, length.out = nrow(treatment_variation)) + diff_h / 2, lty = 1, col = "white", lwd = 0.3)
abline(v = seq(0, 1, length.out = ncol(treatment_variation)) + diff_v / 2, lty = 1, col = "white", lwd = 0.3)
legend("topright", legend = c("Treated", "Control"), fill = c("#1E88A8", "lightgray"), bg = "white", cex = 0.8)
box(lwd = 1.1)
```

As we can see in the above plot, states receive the treatment at different years ranging from 1965 at earliest to 1987 at latest (and some of the states never receive the treatment).

### Step 1: Assess the pre-treatment parallel trends

```{r sa_check}
set.seed(1234)
check_sa <- did_check(
formula = log_expenditure ~ treatment,
data = paglayan2019,
id_unit = "id_subject",
id_time = "id_time",
design = "sa",
option = list(n_boot = 200, parallel = TRUE, thres = 1, lag = 1:5)
)
```

Most of the arguments are common to the case of the basic DID design.
There are a few additional arguments specific to the staggered adoption design.

| Argument | Description |
|:-------- | :---------- |
| `design` | A design argument. It should take `design = "sa"` for the staggered adoption design |
| `thres` | A parameter in the `option` argument. It controls the minimum number of treated units for a particular time to be included in the treatment effect estimation. For example if `thres = 2`, the effect for Tennessee will be removed from the time-average effect because it's the only unit who received the treatment in 1972 (i.e., the number of treated units in 1972 is less than the threshold). |

```{r sa_check_summary}
## view estimates
summary(check_sa)
```

`plot()` function behaves slight differently from the basic DID design.
By default, it plots the treatment variation plot on the right.

```{r sa_check_plot, dpi=250, fig.height = 4, fig.width = 7, out.width="100%", fig.align="center"}
plot(check_sa)
```

### Step 2: Estimate staggered-adoption average treatment effect

`did()` function can handle the staggered adoption design by setting the `design` argument to `design = "sa"`.

```{r sa_estimation}
## estimate time-weighted SA-ATE
set.seed(1234)
fit_sa <- did(
formula = log_expenditure ~ treatment,
data = paglayan2019,
id_unit = "id_subject",
id_time = "id_time",
design = "sa",
option = list(n_boot = 200, parallel = TRUE, thres = 1, lead = 0:9)
)
```

```{r sa_est_summary}
head(summary(fit_sa))
```

```{r sa_fit_plot, dpi=250, fig.height = 4, fig.width = 7, out.width="90%", fig.align="center"}
## plot treatment effects + assessment statistic
sa_plot <- plot(fit_sa, check_sa, band = TRUE)

## show plot
sa_plot +
ggplot2::ylim(-0.1, 0.1) +
ggplot2::geom_vline(xintercept = 0, color = "red", linetype = "dotted")
```