Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/theeconomist/us-potus-model

Code for a dynamic multilevel Bayesian model to predict US presidential elections. Written in R and Stan.
https://github.com/theeconomist/us-potus-model

Last synced: 18 days ago
JSON representation

Code for a dynamic multilevel Bayesian model to predict US presidential elections. Written in R and Stan.

Awesome Lists containing this project

README

        

---
title: "State and national presidential election forecasting model"
subtitle: "By The Economist with Andrew Gelman and Merlin Heidemanns, Columbia University"
date: "Last update on `r format(Sys.time(), '%A %B %d, %Y at %I:%M %p %Z',tz='US/Eastern')`"
output: github_document
---

Code for a dynamic multilevel Bayesian model to predict US presidential elections. Written in R and Stan.

Improving on Pierre Kremp's [implementation](http://www.slate.com/features/pkremp_forecast/report.html) of Drew Linzer's dynamic linear model for election forecasting [(Linzer 2013)](https://votamatic.org/wp-content/uploads/2013/07/Linzer-JASA13.pdf), we (1) add corrections for partisan non-response, survey mode and survey population; (2) use informative state-level priors that update throughout the election year; and (3) specify empirical state-level correlations from political and demographic variables.

You can see the model's predictions for 2020 [here](https://projects.economist.com/us-2020-forecast/president) and read how it works [here](https://projects.economist.com/us-2020-forecast/president/how-this-works).

## File dictionary

In terms of useful files, you should pay attention to the 3 scripts for the 2008, 2012 and 2016 US presidential elections are located in the `scripts/model` directory. There are three R scripts that import data, run models and parse results:

* `final_model_2008.R`
* `final_model_2012.R`
* `final_model_2016.R`

And there are 3 different Stan scripts that will run different versions of our polling aggregate and election forecasting model:

* `poll_model_2020.stan` - the final model we use for the 2020 presidential election
* `poll_model_2020_no_mode_adjustment.stan` - a model that removes the correction for partisan non-response bias in the polls and the adjustments for the mode in which a survey is conducted (live phone, online, other) and its population (adult, likely voter, registered voter)

## Model performance

Here is a graphical summary of the model's performance in 2008, 2012 and 2016.

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(lubridate)
library(gridExtra)
library(boot)
library(pbapply)
library(urbnmapr)
library(ggrepel)
library(DT)

# source('scripts/model/final_2016.R');source('scripts/model/final_2012.R');source('scripts/model/final_2008.R');beepr::beep(2);

```

### 2008

```{r include=F}
## Master variables
RUN_DATE <- min(ymd('2008-11-03'),Sys.Date())

election_day <- ymd("2008-11-03")
start_date <- as.Date("2008-03-01") # Keeping all polls after March 1, 2016

evs_df <- read_csv('data/2012.csv')

# will read in the most recent back-test for 2008
out <- read_rds(sprintf('models/backtest_2008/stan_model_%s_normal_final.rds',RUN_DATE))

# get all the polling data
#setwd(here("data/"))
all_polls <- read_csv('data/all_polls_2008.csv')

# select relevant columns from HufFPost polls
all_polls <- all_polls %>%
dplyr::select(state, pollster, number.of.observations, mode,population,
start.date,
end.date,
obama, mccain, undecided, other)%>%
mutate(end.date = mdy(end.date),
start.date = mdy(start.date))

# make sure we've got nothing from the futuree
all_polls <- all_polls %>%
filter(ymd(end.date) <= RUN_DATE)

# basic mutations
df <- all_polls %>%
tbl_df %>%
rename(n = number.of.observations) %>%
mutate(begin = ymd(start.date),
end = ymd(end.date),
t = end - (1 + as.numeric(end-begin)) %/% 2) %>%
filter(t >= start_date & !is.na(t)
& n > 1)

# pollster mutations
df <- df %>%
mutate(pollster = str_extract(pollster, pattern = "[A-z0-9 ]+") %>% sub("\\s+$", "", .),
pollster = replace(pollster, pollster == "Fox News", "FOX"), # Fixing inconsistencies in pollster names
pollster = replace(pollster, pollster == "WashPost", "Washington Post"),
pollster = replace(pollster, pollster == "ABC News", "ABC"),
undecided = ifelse(is.na(undecided), 0, undecided),
other = ifelse(is.na(other), 0, other))

# vote shares etc
df <- df %>%
mutate(two_party_sum = obama + mccain,
polltype = as.integer(as.character(recode(population,
"Likely Voters" = "0",
"Registered Voters" = "1",
"Adults" = "2"))),
n_respondents = round(n),
# obama
n_obama = round(n * obama/100),
pct_obama = obama/two_party_sum,
# mccain
n_mccain = round(n * mccain/100),
pct_mccain = mccain/two_party_sum,
# third-party
n_other = round(n * other/100),
p_other = other/100)

# state info
state_correlation <- read.csv("data/potus_results_76_16.csv") %>%
select(year, state, dem) %>%
group_by(state) %>%
mutate(dem = dem ) %>%
spread(state,dem) %>%
dplyr::select(-year)

#here("data")
state_data <- read.csv("data/potus_results_76_16.csv")

all_state_names <- unique(state_data$state)

# Numerical indices passed to Stan for states, days, weeks, pollsters
df <- df %>%
mutate(poll_day = t - min(t) + 1,
# Factors are alphabetically sorted: 1 = --, 2 = AL, 3 = AK, 4 = AZ...
index_s = as.numeric(factor(as.character(state),
levels = c('--',all_state_names))), # ensure levels are same as all 50 names in sate_correlation
index_t = 1 + as.numeric(t) - min(as.numeric(t)),
index_p = as.numeric(as.factor(as.character(pollster))))

T <- as.integer(round(difftime(election_day, min(df$start.date))))

# selections
df <- df %>%
arrange(state, t, polltype, two_party_sum) %>%
distinct(state, t, pollster, .keep_all = TRUE) %>%
select(
# poll information
state, t, begin, end, pollster, polltype, method = mode, n_respondents,
# vote shares
pct_obama, n_obama,
pct_mccain, n_mccain,
p_other, n_other, poll_day, index_s, index_p, index_t) %>%
mutate(index_s = ifelse(index_s == 1, 52, index_s - 1)) # national index = 51

# state weights
#setwd(here("data/"))
state_elec_results <- read.csv("data/2008.csv",
header = TRUE, stringsAsFactors = FALSE) %>%
mutate(score = obama_count / (obama_count + mccain_count),
national_score = sum(obama_count)/sum(obama_count + mccain_count),
delta = score - national_score,
share_national_vote = (total_count)
/sum(total_count)) %>%
arrange(state)

rownames(state_elec_results) <- state_elec_results$state

# get state incdices
all_states <- state_elec_results$state
state_name <- state_elec_results$state_name
names(state_name) <- state_elec_results$state

# set prior differences
prior_diff_score <- state_elec_results$delta
names(prior_diff_score) <- all_states

# set state weights
state_weights <- c(state_elec_results$share_national_vote / sum(state_elec_results$share_national_vote))
names(state_weights) <- c(state_elec_results$state)

# electoral votes, by state:
ev_state <- state_elec_results$ev
names(ev_state) <- state_elec_results$state

# read in prior
prior_in <- read_csv("data/state_priors_08_12_16.csv") %>%
filter(date <= RUN_DATE) %>%
group_by(state) %>%
arrange(date) %>%
filter(date == max(date)) %>%
select(state,pred) %>%
ungroup() %>%
arrange(state)

mu_b_prior <- logit(prior_in$pred)
names(mu_b_prior) <- prior_in$state
names(mu_b_prior) == names(prior_diff_score) # correct order?
national_mu_prior <- weighted.mean(inv.logit(mu_b_prior), state_weights)

# extract predictions
# states
predicted_score <- rstan::extract(out, pars = "predicted_score")[[1]]

pct_obama <- pblapply(1:dim(predicted_score)[3],
function(x){
# pred is mu_a + mu_b for the past, just mu_b for the future
temp <- predicted_score[,,x]

# put in tibble
tibble(low = apply(temp,2,function(x){(quantile(x,0.025))}),
high = apply(temp,2,function(x){(quantile(x,0.975))}),
mean = apply(temp,2,function(x){(mean(x))}),
prob = apply(temp,2,function(x){(mean(x>0.5))}),
state = x)

}) %>% do.call('bind_rows',.)

pct_obama$state = colnames(state_correlation)[pct_obama$state]

pct_obama <- pct_obama %>%
group_by(state) %>%
mutate(t = row_number() + min(df$begin)) %>%
ungroup()

# national
pct_obama_natl <- pblapply(1:dim(predicted_score)[2],
function(x){
# each row is a day for a particular draw
temp <- predicted_score[x,,] %>% as.data.frame()
names(temp) <- colnames(state_correlation)

# for each row, get weigted natl vote
tibble(natl_vote = apply(temp,MARGIN = 1,function(y){weighted.mean(y,state_weights)})) %>%
mutate(t = row_number() + min(df$begin)) %>%
mutate(draw = x)
}) %>% do.call('bind_rows',.)

pct_obama_natl <- pct_obama_natl %>%
group_by(t) %>%
summarise(low = quantile(natl_vote,0.025),
high = quantile(natl_vote,0.975),
mean = mean(natl_vote),
prob = mean(natl_vote > 0.5)) %>%
mutate(state = '--')

# bind state and national vote
pct_obama <- pct_obama %>%
bind_rows(pct_obama_natl) %>%
arrange(desc(mean))

# look
ex_states <- c('IA','FL','OH','WI','MI','PA','AZ','NC','NH','TX','GA','MN')
pct_obama %>% filter(t == RUN_DATE,state %in% c(ex_states,'--')) %>% mutate(se = (high - mean)/1.96) %>% dplyr::select(-t)

map.2008.gg <- urbnmapr::states %>%
left_join(pct_obama %>% filter(t == max(t)) %>%
select(state_abbv=state,prob)) %>%
ggplot(aes(x=long,y=lat,group=group,fill=prob)) +
geom_polygon(col='gray40') +
coord_map("albers",lat0=39, lat1=45) +
scale_fill_gradient2(name='Democratic win probability',high='blue',low='red',mid='gray90',midpoint=0.5) +
theme_void() +
theme(legend.position = 'top')

# electoral college by simulation
draws <- pblapply(1:dim(predicted_score)[3],
function(x){
# pred is mu_a + mu_b for the past, just mu_b for the future
pct_obama <- predicted_score[,,x]

pct_obama <- pct_obama %>%
as.data.frame() %>%
mutate(draw = row_number()) %>%
gather(t,pct_obama,1:(ncol(.)-1)) %>%
mutate(t = as.numeric(gsub('V','',t)) + min(df$begin),
state = all_state_names[x])


}) %>% do.call('bind_rows',.)

sim_evs <- draws %>%
left_join(state_elec_results %>% select(state,ev),by='state') %>%
group_by(t,draw) %>%
summarise(dem_ev = sum(ev * (pct_obama > 0.5))) %>%
group_by(t) %>%
summarise(mean_dem_ev = mean(dem_ev),
median_dem_ev = median(dem_ev),
high_dem_ev = quantile(dem_ev,0.975),
low_dem_ev = quantile(dem_ev,0.025),
prob = mean(dem_ev >= 270)) %>%
left_join(pct_obama[pct_obama$state != '--',] %>%
left_join(state_elec_results %>% select(state,ev),by='state') %>%
group_by(t) %>%
summarise(sum_dem_ev = sum(ev * (prob > 0.5))) )

# add identifier
identifier <- paste0(Sys.Date()," || " , out@model_name)

natl_polls.2008.gg <- pct_obama %>%
filter(state == '--') %>%
left_join(df %>% select(state,t,pct_obama)) %>% # plot over time
# plot
ggplot(.,aes(x=t)) +
geom_ribbon(aes(ymin=low,ymax=high),col=NA,alpha=0.2) +
geom_hline(yintercept = 0.5) +
geom_hline(yintercept = national_mu_prior,linetype=2) +
geom_point(aes(y=pct_obama),alpha=0.3) +
geom_line(aes(y=mean)) +
facet_wrap(~state) +
theme_minimal() +
theme(legend.position = 'none') +
scale_x_date(limits=c(ymd('2008-03-01','2008-11-03')),date_breaks='1 month',date_labels='%b') +
scale_y_continuous(breaks=seq(0,1,0.02)) +
labs(subtitletitle=sprintf('obama natl pct | mean = %s | p(win) = %s',
round(pct_obama[pct_obama$state=='--' & pct_obama$t==election_day,]$mean*100,1),
round(pct_obama[pct_obama$state=='--' & pct_obama$t==election_day,]$prob,2)))

natl_evs.2008.gg <- ggplot(sim_evs, aes(x=t)) +
geom_hline(yintercept = 270) +
geom_line(aes(y=median_dem_ev)) +
geom_ribbon(aes(ymin=low_dem_ev,ymax=high_dem_ev),alpha=0.2) +
theme_minimal() +
theme(legend.position = 'none') +
scale_x_date(limits=c(ymd('2008-03-01','2008-11-03')),date_breaks='1 month',date_labels='%b') +
labs(subtitletitle=sprintf('obama evs | median = %s | p(win) = %s',
round(sim_evs[sim_evs$t==election_day,]$median_dem_ev),
round(sim_evs[sim_evs$t==election_day,]$prob,2)))

state_polls.2008.gg <- pct_obama %>%
filter(state %in% ex_states) %>%
left_join(df %>% select(state,t,pct_obama)) %>%
left_join(tibble(state = names(mu_b_prior),
prior = inv.logit(mu_b_prior)) ) %>%
ggplot(.,aes(x=t,col=state)) +
geom_ribbon(aes(ymin=low,ymax=high),col=NA,alpha=0.2) +
geom_hline(yintercept = 0.5) +
geom_hline(aes(yintercept = prior),linetype=2) +
geom_point(aes(y=pct_obama),alpha=0.3) +
geom_line(aes(y=mean)) +
facet_wrap(~state) +
theme_minimal() +
theme(legend.position = 'top') +
guides(color='none') +
scale_x_date(limits=c(ymd('2008-03-01','2008-11-03')),date_breaks='1 month',date_labels='%b') +
labs(subtitle='pct_obama state')

# diff from national over time?
state_lean.2008.gg <- pct_obama[pct_obama$state != '--',] %>%
left_join(pct_obama[pct_obama$state=='--',] %>%
select(t,pct_obama_national=mean), by='t') %>%
mutate(diff=mean-pct_obama_national) %>%
group_by(state) %>%
mutate(last_prob = last(prob)) %>%
filter(state %in% ex_states) %>%
ggplot(.,aes(x=t,y=diff,col=state)) +
geom_hline(yintercept=0.0) +
geom_line() +
geom_label_repel(data = . %>%
filter(t==max(t),
state %in% ex_states),
aes(label=state)) +
theme_minimal() +
theme(legend.position = 'none') +
scale_x_date(limits=c(ymd('2008-03-01','2008-11-03')),date_breaks='1 month',date_labels='%b') +
scale_y_continuous(breaks=seq(-1,1,0.01)) +
labs(subtitle = identifier)

# brier scores
# https://www.buzzfeednews.com/article/jsvine/2016-election-forecast-grades
compare <- pct_obama %>%
filter(t==max(t),state!='--') %>%
select(state,obama_win = prob) %>%
mutate(obama_win_actual = ifelse(state %in% c('CA','NV','OR','WA','CO','NM','MN','IL','VA','DC','MD','DE','NJ','CT','RI','MA','NH','VT','NY','HI','ME','MI','IA','OH','PA','WI','FL','NC','IN'),1,0),
diff = (obama_win_actual - obama_win )^2) %>%
left_join(enframe(ev_state) %>% set_names(.,c('state','ev'))) %>%
mutate(ev_weight = ev/(sum(ev)))

briers.2008 <- tibble(outlet='economist (backtest)',
ev_wtd_brier = weighted.mean(compare$diff, compare$ev_weight),
unwtd_brier = mean(compare$diff),
states_correct=sum(round(compare$obama_win) == round(compare$obama_win_actual)))

# compare model estiamte v actual results
model_v_actual.2008 <- pct_obama %>%
filter(t==max(t),state!='--') %>%
left_join(politicaldata::pres_results %>% filter(year == 2008) %>%
mutate(actual = dem/(dem+rep)) %>%
dplyr::select(state,actual)) %>%
filter(state != 'DC')

model_v_actual.2008.rmse <- sqrt(mean(c((model_v_actual.2008$mean - model_v_actual.2008$actual)^2)))

model_v_actual.2008.gg <- ggplot(model_v_actual.2008,aes(x=mean,y=actual,label=state)) +
geom_abline() +
geom_text() +
stat_smooth(method='lm') +
theme_minimal()

# model vs final polls vs prior
predictcompare.2008.gg <- pct_obama %>%
filter(t == max(t),state %in% ex_states) %>%
mutate(se = (high - mean)/1.68) %>%
select(state,model_mean=mean,model_se=se) %>%
left_join(df %>%
filter(t > (max(t)-14),
state %in% ex_states) %>%
group_by(state) %>%
summarise(poll = weighted.mean(pct_obama,n_respondents))) %>%
left_join(enframe(mu_b_prior,'state','prior') %>%
mutate(prior = inv.logit(prior))) %>%
left_join(model_v_actual.2008 %>% dplyr::select(state,actual)) %>%
ggplot(.,aes(y=state)) +
geom_point(aes(x=poll,col='poll')) +
geom_point(aes(x=model_mean,col='model')) +
geom_point(aes(x=prior,col='prior')) +
geom_point(aes(x=actual,col='result')) +
theme_minimal()

# model better than polls?
pct_obama %>%
filter(t == max(t),state %in% ex_states) %>%
mutate(se = (high - mean)/1.68) %>%
select(state,model_mean=mean,model_se=se) %>%
left_join(df %>%
filter(t > (max(t)-14),
state %in% ex_states) %>%
group_by(state) %>%
summarise(poll = weighted.mean(pct_obama,n_respondents))) %>%
left_join(enframe(mu_b_prior,'state','prior') %>%
mutate(prior = inv.logit(prior))) %>%
left_join(model_v_actual.2008 %>% dplyr::select(state,actual)) %>%
summarise(poll_error = mean(abs(poll-actual)),
model_error = mean(abs(model_mean-actual)))

# expected EVs roughly match the stochastic simulations?
final_states <- enframe(mu_b_prior,'state','prior') %>%
left_join(df %>%
filter(t > (max(t)-14),) %>%
group_by(state) %>%
summarise(poll = weighted.mean(pct_obama,n_respondents))) %>%
mutate(avg_poll_less_prior = mean(poll - inv.logit(prior), na.rm=T)) %>%
mutate(poll = ifelse(is.na(poll),
inv.logit(prior) + avg_poll_less_prior,
poll)) %>%
select(state,poll)

# final EV distribution
final_evs <- draws %>%
left_join(state_elec_results %>% select(state,ev),by='state') %>%
filter(t==max(t)) %>%
group_by(draw) %>%
summarise(dem_ev = sum(ev* (pct_obama > 0.5)))

final_evs.2008.gg <- ggplot(final_evs,aes(x=dem_ev,
fill=ifelse(dem_ev>=270,'Democratic','Republican'))) +
geom_vline(xintercept = 270) +
geom_histogram(binwidth=1) +
theme_minimal() +
theme(legend.position = 'top',
panel.grid.minor = element_blank()) +
scale_fill_manual(name='Electoral College winner',values=c('Democratic'='#3A4EB1','Republican'='#E40A04')) +
labs(x='Democratic electoral college votes',title='2008, Potential Electoral College outcomes',
subtitle=sprintf("p(dem win) = %s | full stan model",round(mean(final_evs$dem_ev>=270),3)))

# get the p value of the actual results
p_values_2008 <- draws %>%
filter(state != '--',
t == election_day) %>%
left_join(politicaldata::pres_results %>% filter(year == 2008) %>%
mutate(actual = dem/(dem+rep)) %>%
dplyr::select(state,actual)) %>%
group_by(state) %>%
summarise(high = quantile(pct_obama, 0.975),
low = quantile(pct_obama, 0.025),
mean = mean(pct_obama),
prob = mean(pct_obama > 0.5),
actual = unique(actual),
p_value =
( 2*sum(pct_obama < actual) + 1 ) /
( 2 * n() + 2) #,
# p_value_2 = mean(ifelse(actual >= mean(pct_obama),
# pct_obama >= actual,
# pct_obama <= actual))
) %>%
arrange(p_value) %>%
mutate(year = 2008,
outside_ci = ifelse(actual > high | actual < low, TRUE,FALSE))

```

#### Map

```{r echo=F,message=F,warning=F}
map.2008.gg
```

#### Final electoral college histogram

```{r echo=F,message=F,warning=F}
final_evs.2008.gg
```

#### National and state polls and the electoral college over time

```{r echo=F,message=F,warning=F,fig.width = 9, fig.asp = .62}
grid.arrange(natl_polls.2008.gg, natl_evs.2008.gg, state_polls.2008.gg,
layout_matrix = rbind(c(1,1,3,3,3),
c(2,2,3,3,3)),
top = identifier
)
```

#### State vs national deltas over time

```{r echo=F,message=F,warning=F}
state_lean.2008.gg
```

#### Model results vs polls vs the prior

```{r echo=F,message=F,warning=F}
predictcompare.2008.gg
```

#### Performance

````{r echo=F,message=F,warning=F}
knitr::kable(briers.2008)
model_v_actual.2008.gg
model_v_actual.2008.rmse

```

#### Predictions for each state

```{r echo=F,message=F,warning=F}
pct_obama %>%
filter(t == max(t)) %>%
mutate(se = (high - mean)/1.68) %>%
mutate(high = round(high,3),
low = round(low,3),
mean = round(mean,3),
prob = round(prob,3),
se = round(se,3)) %>%
select(state,mean,low,high,prob,se) %>%
arrange(abs(mean-0.5)) %>%
knitr::kable()

```

### 2012

```{r include=F}
## Master variables
RUN_DATE <- min(ymd('2012-11-06'),Sys.Date())

election_day <- ymd("2012-11-06")
start_date <- as.Date("2012-03-01") # Keeping all polls after March 1, 2016

evs_df <- read_csv('data/2012.csv')

# will read in the most recent back-test for 2012
out <- read_rds(sprintf('models/backtest_2012/stan_model_%s_normal_final.rds',RUN_DATE))

# get all the polling data
#setwd(here("data/"))
all_polls <- read_csv('data/all_polls_2012.csv')

# select relevant columns from HufFPost polls
all_polls <- all_polls %>%
dplyr::select(state, pollster, number.of.observations, mode,population,
start.date,
end.date,
obama, romney, undecided, other)%>%
mutate(end.date = mdy(end.date),
start.date = mdy(start.date))

# make sure we've got nothing from the futuree
all_polls <- all_polls %>%
filter(ymd(end.date) <= RUN_DATE)

# basic mutations
df <- all_polls %>%
tbl_df %>%
rename(n = number.of.observations) %>%
mutate(begin = ymd(start.date),
end = ymd(end.date),
t = end - (1 + as.numeric(end-begin)) %/% 2) %>%
filter(t >= start_date & !is.na(t)
& n > 1)

# pollster mutations
df <- df %>%
mutate(pollster = str_extract(pollster, pattern = "[A-z0-9 ]+") %>% sub("\\s+$", "", .),
pollster = replace(pollster, pollster == "Fox News", "FOX"), # Fixing inconsistencies in pollster names
pollster = replace(pollster, pollster == "WashPost", "Washington Post"),
pollster = replace(pollster, pollster == "ABC News", "ABC"),
undecided = ifelse(is.na(undecided), 0, undecided),
other = ifelse(is.na(other), 0, other))

# vote shares etc
df <- df %>%
mutate(two_party_sum = obama + romney,
polltype = as.integer(as.character(recode(population,
"Likely Voters" = "0",
"Registered Voters" = "1",
"Adults" = "2"))),
n_respondents = round(n),
# obama
n_obama = round(n * obama/100),
pct_obama = obama/two_party_sum,
# romney
n_romney = round(n * romney/100),
pct_romney = romney/two_party_sum,
# third-party
n_other = round(n * other/100),
p_other = other/100)

# state info

#here("data")
state_data <- read.csv("data/potus_results_76_16.csv")

all_state_names <- unique(state_data$state)

# Numerical indices passed to Stan for states, days, weeks, pollsters
df <- df %>%
mutate(poll_day = t - min(t) + 1,
# Factors are alphabetically sorted: 1 = --, 2 = AL, 3 = AK, 4 = AZ...
index_s = as.numeric(factor(as.character(state),
levels = c('--',all_state_names))), # ensure levels are same as all 50 names in sate_correlation
index_t = 1 + as.numeric(t) - min(as.numeric(t)),
index_p = as.numeric(as.factor(as.character(pollster))))

T <- as.integer(round(difftime(election_day, min(df$start.date))))

# selections
df <- df %>%
arrange(state, t, polltype, two_party_sum) %>%
distinct(state, t, pollster, .keep_all = TRUE) %>%
select(
# poll information
state, t, begin, end, pollster, polltype, method = mode, n_respondents,
# vote shares
pct_obama, n_obama,
pct_romney, n_romney,
p_other, n_other, poll_day, index_s, index_p, index_t) %>%
mutate(index_s = ifelse(index_s == 1, 52, index_s - 1)) # national index = 51

# state weights
#setwd(here("data/"))
state_elec_results <- read.csv("data/2012.csv",
header = TRUE, stringsAsFactors = FALSE) %>%
mutate(score = obama_count / (obama_count + romney_count),
national_score = sum(obama_count)/sum(obama_count + romney_count),
delta = score - national_score,
share_national_vote = (total_count)
/sum(total_count)) %>%
arrange(state)

rownames(state_elec_results) <- state_elec_results$state

# get state incdices
all_states <- state_elec_results$state
state_name <- state_elec_results$state_name
names(state_name) <- state_elec_results$state

# set prior differences
prior_diff_score <- state_elec_results$delta
names(prior_diff_score) <- all_states

# set state weights
state_weights <- c(state_elec_results$share_national_vote / sum(state_elec_results$share_national_vote))
names(state_weights) <- c(state_elec_results$state)

# electoral votes, by state:
ev_state <- state_elec_results$ev
names(ev_state) <- state_elec_results$state

# read in prior
prior_in <- read_csv("data/state_priors_08_12_16.csv") %>%
filter(date <= RUN_DATE) %>%
group_by(state) %>%
arrange(date) %>%
filter(date == max(date)) %>%
select(state,pred) %>%
ungroup() %>%
arrange(state)

mu_b_prior <- logit(prior_in$pred)
names(mu_b_prior) <- prior_in$state
names(mu_b_prior) == names(prior_diff_score) # correct order?
national_mu_prior <- weighted.mean(inv.logit(mu_b_prior), state_weights)

# extract predictions
predicted_score <- rstan::extract(out, pars = "predicted_score")[[1]]

# states
pct_obama <- pblapply(1:dim(predicted_score)[3],
function(x){
# pred is mu_a + mu_b for the past, just mu_b for the future
temp <- predicted_score[,,x]

# put in tibble
tibble(low = apply(temp,2,function(x){(quantile(x,0.025))}),
high = apply(temp,2,function(x){(quantile(x,0.975))}),
mean = apply(temp,2,function(x){(mean(x))}),
prob = apply(temp,2,function(x){(mean(x>0.5))}),
state = x)

}) %>% do.call('bind_rows',.)

pct_obama$state = colnames(state_correlation)[pct_obama$state]

pct_obama <- pct_obama %>%
group_by(state) %>%
mutate(t = row_number() + min(df$begin)) %>%
ungroup()

# national
pct_obama_natl <- pblapply(1:dim(predicted_score)[2],
function(x){
# each row is a day for a particular draw
temp <- predicted_score[x,,] %>% as.data.frame()
names(temp) <- colnames(state_correlation)

# for each row, get weigted natl vote
tibble(natl_vote = apply(temp,MARGIN = 1,function(y){weighted.mean(y,state_weights)})) %>%
mutate(t = row_number() + min(df$begin)) %>%
mutate(draw = x)
}) %>% do.call('bind_rows',.)

pct_obama_natl <- pct_obama_natl %>%
group_by(t) %>%
summarise(low = quantile(natl_vote,0.025),
high = quantile(natl_vote,0.975),
mean = mean(natl_vote),
prob = mean(natl_vote > 0.5)) %>%
mutate(state = '--')

# bind state and national vote
pct_obama <- pct_obama %>%
bind_rows(pct_obama_natl) %>%
arrange(desc(mean))

# look
ex_states <- c('IA','FL','OH','WI','MI','PA','AZ','NC','NH','TX','GA','MN')
pct_obama %>% filter(t == RUN_DATE,state %in% c(ex_states,'--')) %>% mutate(se = (high - mean)/1.96) %>% dplyr::select(-t)

map.2012.gg <- urbnmapr::states %>%
left_join(pct_obama %>% filter(t == max(t)) %>%
select(state_abbv=state,prob)) %>%
ggplot(aes(x=long,y=lat,group=group,fill=prob)) +
geom_polygon(col='gray40') +
coord_map("albers",lat0=39, lat1=45) +
scale_fill_gradient2(name='Democratic win probability',high='blue',low='red',mid='gray90',midpoint=0.5) +
theme_void() +
theme(legend.position = 'top')

# electoral college by simulation
draws <- pblapply(1:dim(predicted_score)[3],
function(x){
# pred is mu_a + mu_b for the past, just mu_b for the future
pct_obama <- predicted_score[,,x]

pct_obama <- pct_obama %>%
as.data.frame() %>%
mutate(draw = row_number()) %>%
gather(t,pct_obama,1:(ncol(.)-1)) %>%
mutate(t = as.numeric(gsub('V','',t)) + min(df$begin),
state = all_state_names[x])


}) %>% do.call('bind_rows',.)

sim_evs <- draws %>%
left_join(state_elec_results %>% select(state,ev),by='state') %>%
group_by(t,draw) %>%
summarise(dem_ev = sum(ev * (pct_obama > 0.5))) %>%
group_by(t) %>%
summarise(mean_dem_ev = mean(dem_ev),
median_dem_ev = median(dem_ev),
high_dem_ev = quantile(dem_ev,0.975),
low_dem_ev = quantile(dem_ev,0.025),
prob = mean(dem_ev >= 270)) %>%
left_join(pct_obama[pct_obama$state != '--',] %>%
left_join(state_elec_results %>% select(state,ev),by='state') %>%
group_by(t) %>%
summarise(sum_dem_ev = sum(ev * (prob > 0.5))) )

# add identifier
identifier <- paste0(Sys.Date()," || " , out@model_name)

natl_polls.2012.gg <- pct_obama %>%
filter(state == '--') %>%
left_join(df %>% select(state,t,pct_obama)) %>% # plot over time
# plot
ggplot(.,aes(x=t)) +
geom_ribbon(aes(ymin=low,ymax=high),col=NA,alpha=0.2) +
geom_hline(yintercept = 0.5) +
geom_hline(yintercept = national_mu_prior,linetype=2) +
geom_point(aes(y=pct_obama),alpha=0.3) +
geom_line(aes(y=mean)) +
facet_wrap(~state) +
theme_minimal() +
theme(legend.position = 'none') +
scale_x_date(limits=c(ymd('2012-03-01','2012-11-06')),date_breaks='1 month',date_labels='%b') +
scale_y_continuous(breaks=seq(0,1,0.02)) +
labs(subtitletitle=sprintf('obama natl pct | mean = %s | p(win) = %s',
round(pct_obama[pct_obama$state=='--' & pct_obama$t==election_day,]$mean*100,1),
round(pct_obama[pct_obama$state=='--' & pct_obama$t==election_day,]$prob,2)))

natl_evs.2012.gg <- ggplot(sim_evs, aes(x=t)) +
geom_hline(yintercept = 270) +
geom_line(aes(y=median_dem_ev)) +
geom_ribbon(aes(ymin=low_dem_ev,ymax=high_dem_ev),alpha=0.2) +
theme_minimal() +
theme(legend.position = 'none') +
scale_x_date(limits=c(ymd('2012-03-01','2012-11-06')),date_breaks='1 month',date_labels='%b') +
labs(subtitletitle=sprintf('obama evs | median = %s | p(win) = %s',
round(sim_evs[sim_evs$t==election_day,]$median_dem_ev),
round(sim_evs[sim_evs$t==election_day,]$prob,2)))

state_polls.2012.gg <- pct_obama %>%
filter(state %in% ex_states) %>%
left_join(df %>% select(state,t,pct_obama)) %>%
left_join(tibble(state = names(mu_b_prior),
prior = inv.logit(mu_b_prior)) ) %>%
ggplot(.,aes(x=t,col=state)) +
geom_ribbon(aes(ymin=low,ymax=high),col=NA,alpha=0.2) +
geom_hline(yintercept = 0.5) +
geom_hline(aes(yintercept = prior),linetype=2) +
geom_point(aes(y=pct_obama),alpha=0.3) +
geom_line(aes(y=mean)) +
facet_wrap(~state) +
theme_minimal() +
theme(legend.position = 'top') +
guides(color='none') +
scale_x_date(limits=c(ymd('2012-03-01','2012-11-06')),date_breaks='1 month',date_labels='%b') +
labs(subtitle='pct_obama state')

# diff from national over time?
state_lean.2012.gg <- pct_obama[pct_obama$state != '--',] %>%
left_join(pct_obama[pct_obama$state=='--',] %>%
select(t,pct_obama_national=mean), by='t') %>%
mutate(diff=mean-pct_obama_national) %>%
group_by(state) %>%
mutate(last_prob = last(prob)) %>%
filter(state %in% ex_states) %>%
ggplot(.,aes(x=t,y=diff,col=state)) +
geom_hline(yintercept=0.0) +
geom_line() +
geom_label_repel(data = . %>%
filter(t==max(t),
state %in% ex_states),
aes(label=state)) +
theme_minimal() +
theme(legend.position = 'none') +
scale_x_date(limits=c(ymd('2012-03-01','2012-11-06')),date_breaks='1 month',date_labels='%b') +
scale_y_continuous(breaks=seq(-1,1,0.01)) +
labs(subtitle = identifier)

# brier scores
# https://www.buzzfeednews.com/article/jsvine/2016-election-forecast-grades
compare <- pct_obama %>%
filter(t==max(t),state!='--') %>%
select(state,obama_win = prob) %>%
mutate(obama_win_actual = ifelse(state %in% c('CA','NV','OR','WA','CO','NM','MN','IL','VA','DC','MD','DE','NJ','CT','RI','MA','NH','VT','NY','HI','ME','MI','IA','OH','PA','WI','FL'),1,0),
diff = (obama_win_actual - obama_win )^2) %>%
left_join(enframe(ev_state) %>% set_names(.,c('state','ev'))) %>%
mutate(ev_weight = ev/(sum(ev)))

briers.2012 <- tibble(outlet = c('Linzer','Wang/Ferguson','Silver/538','Jackman/Pollster','Desart/Holbrook',
'Intrade','Enten/Margin of Error'),
ev_wtd_brier = rep(NA,7),
unwtd_brier = c(0.0038,0.00761,0.00911,0.00971,0.01605,0.02812,0.05075),
) %>%
bind_rows(tibble(outlet='economist (backtest)',
ev_wtd_brier = weighted.mean(compare$diff, compare$ev_weight),
unwtd_brier = mean(compare$diff),
states_correct=sum(round(compare$obama_win) == round(compare$obama_win_actual)))) %>%
arrange(unwtd_brier)

# final v result
model_v_actual.2012 <- pct_obama %>%
filter(t==max(t),state!='--') %>%
left_join(politicaldata::pres_results %>% filter(year == 2012) %>%
mutate(actual = dem/(dem+rep)) %>%
dplyr::select(state,actual)) %>%
filter(state != 'DC')

model_v_actual.2012.rmse <- sqrt(mean(c((model_v_actual.2012$mean - model_v_actual.2012$actual)^2)))

model_v_actual.2012.gg <- ggplot(model_v_actual.2012,aes(x=mean,y=actual,label=state)) +
geom_abline() +
geom_text() +
stat_smooth(method='lm') +
theme_minimal()

# model vs final polls vs prior
predictcompare.2012.gg <- pct_obama %>%
filter(t == max(t),state %in% ex_states) %>%
mutate(se = (high - mean)/1.68) %>%
select(state,model_mean=mean,model_se=se) %>%
left_join(df %>%
filter(t > (max(t)-14),
state %in% ex_states) %>%
group_by(state) %>%
summarise(poll = weighted.mean(pct_obama,n_respondents))) %>%
left_join(enframe(mu_b_prior,'state','prior') %>%
mutate(prior = inv.logit(prior))) %>%
left_join(model_v_actual.2012 %>% dplyr::select(state,actual)) %>%
ggplot(.,aes(y=state)) +
geom_point(aes(x=poll,col='poll')) +
geom_point(aes(x=model_mean,col='model')) +
geom_point(aes(x=prior,col='prior')) +
geom_point(aes(x=actual,col='result')) +
theme_minimal()

# model better than polls?
pct_obama %>%
filter(t == max(t),state %in% ex_states) %>%
mutate(se = (high - mean)/1.68) %>%
select(state,model_mean=mean,model_se=se) %>%
left_join(df %>%
filter(t > (max(t)-14),
state %in% ex_states) %>%
group_by(state) %>%
summarise(poll = weighted.mean(pct_obama,n_respondents))) %>%
left_join(enframe(mu_b_prior,'state','prior') %>%
mutate(prior = inv.logit(prior))) %>%
left_join(model_v_actual.2012 %>% dplyr::select(state,actual)) %>%
summarise(poll_error = mean(abs(poll-actual)),
model_error = mean(abs(model_mean-actual)))

# expected EVs roughly match the stochastic simulations?
final_states <- enframe(mu_b_prior,'state','prior') %>%
left_join(df %>%
filter(t > (max(t)-14),) %>%
group_by(state) %>%
summarise(poll = weighted.mean(pct_obama,n_respondents))) %>%
mutate(avg_poll_less_prior = mean(poll - inv.logit(prior), na.rm=T)) %>%
mutate(poll = ifelse(is.na(poll),
inv.logit(prior) + avg_poll_less_prior,
poll)) %>%
select(state,poll)

# final EV distribution
final_evs <- draws %>%
left_join(state_elec_results %>% select(state,ev),by='state') %>%
filter(t==max(t)) %>%
group_by(draw) %>%
summarise(dem_ev = sum(ev* (pct_obama > 0.5)))

final_evs.2012.gg <- ggplot(final_evs,aes(x=dem_ev,
fill=ifelse(dem_ev>=270,'Democratic','Republican'))) +
geom_vline(xintercept = 270) +
geom_histogram(binwidth=1) +
theme_minimal() +
theme(legend.position = 'top',
panel.grid.minor = element_blank()) +
scale_fill_manual(name='Electoral College winner',values=c('Democratic'='#3A4EB1','Republican'='#E40A04')) +
labs(x='Democratic electoral college votes',title='2012, Potential Electoral College outcomes',
subtitle=sprintf("p(dem win) = %s | full stan model",round(mean(final_evs$dem_ev>=270),3)))

# get the p value of the actual results
p_values_2012 <- draws %>%
filter(state != '--',
t == election_day) %>%
left_join(politicaldata::pres_results %>% filter(year == 2012) %>%
mutate(actual = dem/(dem+rep)) %>%
dplyr::select(state,actual)) %>%
group_by(state) %>%
summarise(high = quantile(pct_obama, 0.975),
low = quantile(pct_obama, 0.025),
mean = mean(pct_obama),
prob = mean(pct_obama > 0.5),
actual = unique(actual),
p_value =
( 2*sum(pct_obama < actual) + 1 ) /
( 2 * n() + 2) #,
# p_value_2 = mean(ifelse(actual >= mean(pct_obama),
# pct_obama >= actual,
# pct_obama <= actual))
) %>%
arrange(p_value) %>%
mutate(year = 2012,
outside_ci = ifelse(actual > high | actual < low, TRUE,FALSE))

```

#### Map

```{r echo=F,message=F,warning=F}
map.2012.gg
```

#### Final electoral college histogram

```{r echo=F,message=F,warning=F}
final_evs.2012.gg
```

#### National and state polls and the electoral college over time

```{r echo=F,message=F,warning=F,fig.width = 9, fig.asp = .62}
grid.arrange(natl_polls.2012.gg, natl_evs.2012.gg, state_polls.2012.gg,
layout_matrix = rbind(c(1,1,3,3,3),
c(2,2,3,3,3)),
top = identifier
)
```

#### State vs national deltas over time

```{r echo=F,message=F,warning=F}
state_lean.2012.gg
```

#### Model results vs polls vs the prior

```{r echo=F,message=F,warning=F}
predictcompare.2012.gg
```

#### Performance

````{r echo=F,message=F,warning=F}

knitr::kable(briers.2012)
model_v_actual.2012.gg
model_v_actual.2012.rmse
```

#### Predictions for each state

```{r echo=F,message=F,warning=F}
pct_obama %>%
filter(t == max(t)) %>%
mutate(se = (high - mean)/1.68) %>%
mutate(high = round(high,3),
low = round(low,3),
mean = round(mean,3),
prob = round(prob,3),
se = round(se,3)) %>%
select(state,mean,low,high,prob,se) %>%
arrange(abs(mean-0.5)) %>%
knitr::kable()

```

### 2016

```{r include=F}
## Master variables
RUN_DATE <- min(ymd('2016-11-08'),Sys.Date())

election_day <- ymd("2016-11-08")
start_date <- as.Date("2016-03-01") # Keeping all polls after March 1, 2016

evs_df <- read_csv('data/2012.csv')

# will read in the most recent back-test for 2012
out <- read_rds(sprintf('models/stan_model_%s_normal_final.rds',RUN_DATE))

# get all the polling data
#setwd(here("data/"))
all_polls <- read_csv('data/all_polls.csv')

# select relevant columns from HufFPost polls
all_polls <- all_polls %>%
dplyr::select(state, pollster, number.of.observations, mode,population,
start.date,
end.date,
clinton, trump, undecided, other)%>%
mutate(end.date = ymd(end.date),
start.date = ymd(start.date))

# make sure we've got nothing from the futuree
all_polls <- all_polls %>%
filter(ymd(end.date) <= RUN_DATE)

# basic mutations
df <- all_polls %>%
tbl_df %>%
rename(n = number.of.observations) %>%
mutate(begin = ymd(start.date),
end = ymd(end.date),
t = end - (1 + as.numeric(end-begin)) %/% 2) %>%
filter(t >= start_date & !is.na(t)
& n > 1)

# pollster mutations
df <- df %>%
mutate(pollster = str_extract(pollster, pattern = "[A-z0-9 ]+") %>% sub("\\s+$", "", .),
pollster = replace(pollster, pollster == "Fox News", "FOX"), # Fixing inconsistencies in pollster names
pollster = replace(pollster, pollster == "WashPost", "Washington Post"),
pollster = replace(pollster, pollster == "ABC News", "ABC"),
undecided = ifelse(is.na(undecided), 0, undecided),
other = ifelse(is.na(other), 0, other))

# vote shares etc
df <- df %>%
mutate(two_party_sum = clinton + trump,
polltype = as.integer(as.character(recode(population,
"Likely Voters" = "0",
"Registered Voters" = "1",
"Adults" = "2"))),
n_respondents = round(n),
# clinton
n_clinton = round(n * clinton/100),
pct_clinton = clinton/two_party_sum,
# trump
n_trump = round(n * trump/100),
pct_trump = trump/two_party_sum,
# third-party
n_other = round(n * other/100),
p_other = other/100)

# state info
state_correlation <- read.csv("data/potus_results_76_16.csv") %>%
select(year, state, dem) %>%
group_by(state) %>%
mutate(dem = dem ) %>%
spread(state,dem) %>%
dplyr::select(-year)

#here("data")
state_data <- read.csv("data/potus_results_76_16.csv")

all_state_names <- unique(state_data$state)

# Numerical indices passed to Stan for states, days, weeks, pollsters
df <- df %>%
mutate(poll_day = t - min(t) + 1,
# Factors are alphabetically sorted: 1 = --, 2 = AL, 3 = AK, 4 = AZ...
index_s = as.numeric(factor(as.character(state),
levels = c('--',all_state_names))), # ensure levels are same as all 50 names in sate_correlation
index_t = 1 + as.numeric(t) - min(as.numeric(t)),
index_p = as.numeric(as.factor(as.character(pollster))))

T <- as.integer(round(difftime(election_day, min(df$start.date))))

# selections
df <- df %>%
arrange(state, t, polltype, two_party_sum) %>%
distinct(state, t, pollster, .keep_all = TRUE) %>%
select(
# poll information
state, t, begin, end, pollster, polltype, method = mode, n_respondents,
# vote shares
pct_clinton, n_clinton,
pct_trump, n_trump,
p_other, n_other, poll_day, index_s, index_p, index_t) %>%
mutate(index_s = ifelse(index_s == 1, 52, index_s - 1)) # national index = 51

# state weights
#setwd(here("data/"))
state_elec_results <- read.csv("data/2012.csv",
header = TRUE, stringsAsFactors = FALSE) %>%
mutate(score = obama_count / (obama_count + romney_count),
national_score = sum(obama_count)/sum(obama_count + romney_count),
delta = score - national_score,
share_national_vote = (total_count)
/sum(total_count)) %>%
arrange(state)

rownames(state_elec_results) <- state_elec_results$state

# get state incdices
all_states <- state_elec_results$state
state_name <- state_elec_results$state_name
names(state_name) <- state_elec_results$state

# set prior differences
prior_diff_score <- state_elec_results$delta
names(prior_diff_score) <- all_states

# set state weights
state_weights <- c(state_elec_results$share_national_vote / sum(state_elec_results$share_national_vote))
names(state_weights) <- c(state_elec_results$state)

# electoral votes, by state:
ev_state <- state_elec_results$ev
names(ev_state) <- state_elec_results$state

# read in prior
prior_in <- read_csv("data/state_priors_08_12_16.csv") %>%
filter(date <= RUN_DATE) %>%
group_by(state) %>%
arrange(date) %>%
filter(date == max(date)) %>%
select(state,pred) %>%
ungroup() %>%
arrange(state)

mu_b_prior <- logit(prior_in$pred)
names(mu_b_prior) <- prior_in$state
names(mu_b_prior) == names(prior_diff_score) # correct order?
national_mu_prior <- weighted.mean(inv.logit(mu_b_prior), state_weights)

# extract predictions
predicted_score <- rstan::extract(out, pars = "predicted_score")[[1]]

# states
pct_clinton <- pblapply(1:dim(predicted_score)[3],
function(x){
# pred is mu_a + mu_b for the past, just mu_b for the future
temp <- predicted_score[,,x]

# put in tibble
tibble(low = apply(temp,2,function(x){(quantile(x,0.025))}),
high = apply(temp,2,function(x){(quantile(x,0.975))}),
mean = apply(temp,2,function(x){(mean(x))}),
prob = apply(temp,2,function(x){(mean(x>0.5))}),
state = x)

}) %>% do.call('bind_rows',.)

pct_clinton$state = colnames(state_correlation)[pct_clinton$state]

pct_clinton <- pct_clinton %>%
group_by(state) %>%
mutate(t = row_number() + min(df$begin)) %>%
ungroup()

# national
pct_clinton_natl <- pblapply(1:dim(predicted_score)[2],
function(x){
# each row is a day for a particular draw
temp <- predicted_score[x,,] %>% as.data.frame()
names(temp) <- colnames(state_correlation)

# for each row, get weigted natl vote
tibble(natl_vote = apply(temp,MARGIN = 1,function(y){weighted.mean(y,state_weights)})) %>%
mutate(t = row_number() + min(df$begin)) %>%
mutate(draw = x)
}) %>% do.call('bind_rows',.)

pct_clinton_natl <- pct_clinton_natl %>%
group_by(t) %>%
summarise(low = quantile(natl_vote,0.025),
high = quantile(natl_vote,0.975),
mean = mean(natl_vote),
prob = mean(natl_vote > 0.5)) %>%
mutate(state = '--')

# bind state and national vote
pct_clinton <- pct_clinton %>%
bind_rows(pct_clinton_natl) %>%
arrange(desc(mean))

# look
ex_states <- c('IA','FL','OH','WI','MI','PA','AZ','NC','NH','TX','GA','MN')
pct_clinton %>% filter(t == RUN_DATE,state %in% c(ex_states,'--')) %>% mutate(se = (high - mean)/1.96) %>% dplyr::select(-t)

map.2016.gg <- urbnmapr::states %>%
left_join(pct_clinton %>% filter(t == max(t)) %>%
select(state_abbv=state,prob)) %>%
ggplot(aes(x=long,y=lat,group=group,fill=prob)) +
geom_polygon(col='gray40') +
coord_map("albers",lat0=39, lat1=45) +
scale_fill_gradient2(name='Democratic win probability',high='blue',low='red',mid='gray90',midpoint=0.5) +
theme_void() +
theme(legend.position = 'top')

# electoral college by simulation
draws <- pblapply(1:dim(predicted_score)[3],
function(x){
# pred is mu_a + mu_b for the past, just mu_b for the future
pct_clinton <- predicted_score[,,x]

pct_clinton <- pct_clinton %>%
as.data.frame() %>%
mutate(draw = row_number()) %>%
gather(t,pct_clinton,1:(ncol(.)-1)) %>%
mutate(t = as.numeric(gsub('V','',t)) + min(df$begin),
state = all_state_names[x])


}) %>% do.call('bind_rows',.)

sim_evs <- draws %>%
left_join(state_elec_results %>% select(state,ev),by='state') %>%
group_by(t,draw) %>%
summarise(dem_ev = sum(ev * (pct_clinton > 0.5))) %>%
group_by(t) %>%
summarise(mean_dem_ev = mean(dem_ev),
median_dem_ev = median(dem_ev),
high_dem_ev = quantile(dem_ev,0.975),
low_dem_ev = quantile(dem_ev,0.025),
prob = mean(dem_ev >= 270)) %>%
left_join(pct_clinton[pct_clinton$state != '--',] %>%
left_join(state_elec_results %>% select(state,ev),by='state') %>%
group_by(t) %>%
summarise(sum_dem_ev = sum(ev * (prob > 0.5))) )

# add identifier
identifier <- paste0(Sys.Date()," || " , out@model_name)

natl_polls.2016.gg <- pct_clinton %>%
filter(state == '--') %>%
left_join(df %>% select(state,t,pct_clinton,method)) %>% # plot over time
# plot
ggplot(.,aes(x=t)) +
geom_ribbon(aes(ymin=low,ymax=high),col=NA,alpha=0.2) +
geom_hline(yintercept = 0.5) +
geom_hline(yintercept = national_mu_prior,linetype=2) +
geom_point(aes(y=pct_clinton,shape=method),alpha=0.3) +
geom_line(aes(y=mean)) +
facet_wrap(~state) +
theme_minimal() +
theme(legend.position = 'none') +
scale_x_date(limits=c(ymd('2016-03-01','2016-11-08')),date_breaks='1 month',date_labels='%b') +
scale_y_continuous(breaks=seq(0,1,0.02)) +
labs(subtitletitle=sprintf('clinton natl pct | mean = %s | p(win) = %s',
round(pct_clinton[pct_clinton$state=='--' & pct_clinton$t==election_day,]$mean*100,1),
round(pct_clinton[pct_clinton$state=='--' & pct_clinton$t==election_day,]$prob,2)))

natl_evs.2016.gg <- ggplot(sim_evs, aes(x=t)) +
geom_hline(yintercept = 270) +
geom_line(aes(y=median_dem_ev)) +
geom_ribbon(aes(ymin=low_dem_ev,ymax=high_dem_ev),alpha=0.2) +
theme_minimal() +
theme(legend.position = 'none') +
scale_x_date(limits=c(ymd('2016-03-01','2016-11-08')),date_breaks='1 month',date_labels='%b') +
labs(subtitletitle=sprintf('clinton evs | median = %s | p(win) = %s',
round(sim_evs[sim_evs$t==election_day,]$median_dem_ev),
round(sim_evs[sim_evs$t==election_day,]$prob,2)))

state_polls.2016.gg <- pct_clinton %>%
filter(state %in% ex_states) %>%
left_join(df %>% select(state,t,pct_clinton,method)) %>%
left_join(tibble(state = names(mu_b_prior),
prior = inv.logit(mu_b_prior)) ) %>%
ggplot(.,aes(x=t,col=state)) +
geom_ribbon(aes(ymin=low,ymax=high),col=NA,alpha=0.2) +
geom_hline(yintercept = 0.5) +
geom_hline(aes(yintercept = prior),linetype=2) +
geom_point(aes(y=pct_clinton,shape=method),alpha=0.3) +
geom_line(aes(y=mean)) +
facet_wrap(~state) +
theme_minimal() +
theme(legend.position = 'top') +
guides(color='none') +
scale_x_date(limits=c(ymd('2016-03-01','2016-11-08')),date_breaks='1 month',date_labels='%b') +
labs(subtitle='pct_clinton state')

# diff from national over time?
state_lean.2016.gg <- pct_clinton[pct_clinton$state != '--',] %>%
left_join(pct_clinton[pct_clinton$state=='--',] %>%
select(t,pct_clinton_national=mean), by='t') %>%
mutate(diff=mean-pct_clinton_national) %>%
group_by(state) %>%
mutate(last_prob = last(prob)) %>%
filter(state %in% ex_states) %>%
ggplot(.,aes(x=t,y=diff,col=state)) +
geom_hline(yintercept=0.0) +
geom_line() +
geom_label_repel(data = . %>%
filter(t==max(t),
state %in% ex_states),
aes(label=state)) +
theme_minimal() +
theme(legend.position = 'none') +
scale_x_date(limits=c(ymd('2016-03-01','2016-11-08')),date_breaks='1 month',date_labels='%b') +
scale_y_continuous(breaks=seq(-1,1,0.01)) +
labs(subtitle = identifier)

# brier scores
# https://www.buzzfeednews.com/article/jsvine/2016-election-forecast-grades
compare <- pct_clinton %>%
filter(t==max(t),state!='--') %>%
select(state,clinton_win=prob) %>%
mutate(clinton_win_actual = ifelse(state %in% c('CA','NV','OR','WA','CO','NM','MN','IL','VA','DC','MD','DE','NJ','CT','RI','MA','NH','VT','NY','HI','ME'),1,0),
diff = (clinton_win_actual - clinton_win )^2) %>% left_join(enframe(ev_state) %>% set_names(.,c('state','ev'))) %>%
mutate(ev_weight = ev/(sum(ev)))

briers.2016 <- tibble(outlet = c('538 polls-plus','538 polls-only','princeton','nyt upshot','kremp/slate','pollsavvy','predictwise markets','predictwise overall','desart and holbrook','daily kos','huffpost'),
ev_wtd_brier = c(0.0928,0.0936,0.1169,0.1208,0.121,0.1219,0.1272,0.1276,0.1279,0.1439,0.1505),
unwtd_brier = c(0.0664,0.0672,0.0744,0.0801,0.0766,0.0794,0.0767,0.0783,0.0825,0.0864,0.0892),
states_correct = c(46,46,47,46,46,46,46,46,44,46,46)) %>%
bind_rows(tibble(outlet='economist (backtest)',
ev_wtd_brier = weighted.mean(compare$diff, compare$ev_weight),
unwtd_brier = mean(compare$diff),
states_correct=sum(round(compare$clinton_win) == round(compare$clinton_win_actual)))) %>%
arrange(ev_wtd_brier)

# compare model estiamte v actual results
model_v_actual.2016 <- pct_clinton %>%
filter(t==max(t),state!='--') %>%
left_join(politicaldata::pres_results %>% filter(year == 2016) %>%
mutate(actual = dem/(dem+rep)) %>%
dplyr::select(state,actual)) %>%
filter(state != 'DC')

model_v_actual.2016.rmse <- sqrt(mean(c((model_v_actual.2016$mean - model_v_actual.2016$actual)^2)))

model_v_actual.2016.gg <- ggplot(model_v_actual.2016,aes(x=mean,y=actual,label=state)) +
geom_abline() +
geom_text() +
stat_smooth(method='lm') +
theme_minimal()

# model vs final polls vs prior
predictcompare.2016.gg <- pct_clinton %>%
filter(t == max(t),state %in% ex_states) %>%
mutate(se = (high - mean)/1.68) %>%
select(state,model_mean=mean,model_se=se) %>%
left_join(df %>%
filter(t > (max(t)-14),
state %in% ex_states) %>%
group_by(state) %>%
summarise(poll = weighted.mean(pct_clinton,n_respondents))) %>%
left_join(enframe(mu_b_prior,'state','prior') %>%
mutate(prior = inv.logit(prior))) %>%
left_join(model_v_actual.2016 %>% dplyr::select(state,actual)) %>%
ggplot(.,aes(y=state)) +
geom_point(aes(x=poll,col='poll')) +
geom_point(aes(x=model_mean,col='model')) +
geom_point(aes(x=prior,col='prior')) +
geom_point(aes(x=actual,col='result')) +
theme_minimal()

# model better than polls?
pct_clinton %>%
filter(t == max(t),state %in% ex_states) %>%
mutate(se = (high - mean)/1.68) %>%
select(state,model_mean=mean,model_se=se) %>%
left_join(df %>%
filter(t > (max(t)-14),
state %in% ex_states) %>%
group_by(state) %>%
summarise(poll = weighted.mean(pct_clinton,n_respondents))) %>%
left_join(enframe(mu_b_prior,'state','prior') %>%
mutate(prior = inv.logit(prior))) %>%
left_join(model_v_actual.2016 %>% dplyr::select(state,actual)) %>%
summarise(poll_error = mean(abs(poll-actual)),
model_error = mean(abs(model_mean-actual)))

# expected EVs roughly match the stochastic simulations?
final_states <- enframe(mu_b_prior,'state','prior') %>%
left_join(df %>%
filter(t > (max(t)-14),) %>%
group_by(state) %>%
summarise(poll = weighted.mean(pct_clinton,n_respondents))) %>%
mutate(avg_poll_less_prior = mean(poll - inv.logit(prior), na.rm=T)) %>%
mutate(poll = ifelse(is.na(poll),
inv.logit(prior) + avg_poll_less_prior,
poll)) %>%
select(state,poll)

# final EV distribution
final_evs <- draws %>%
left_join(state_elec_results %>% select(state,ev),by='state') %>%
filter(t==max(t)) %>%
group_by(draw) %>%
summarise(dem_ev = sum(ev* (pct_clinton > 0.5)))

final_evs.2016.gg <- ggplot(final_evs,aes(x=dem_ev,
fill=ifelse(dem_ev>=270,'Democratic','Republican'))) +
geom_vline(xintercept = 270) +
geom_histogram(binwidth=1) +
theme_minimal() +
theme(legend.position = 'top',
panel.grid.minor = element_blank()) +
scale_fill_manual(name='Electoral College winner',values=c('Democratic'='#3A4EB1','Republican'='#E40A04')) +
labs(x='Democratic electoral college votes',title='2016, Potential Electoral College outcomes',
subtitle=sprintf("p(dem win) = %s | full stan model",round(mean(final_evs$dem_ev>=270),3)))

# get the p value of the actual results
p_values_2016 <- draws %>%
filter(state != '--',
t == election_day) %>%
left_join(politicaldata::pres_results %>% filter(year == 2016) %>%
mutate(actual = dem/(dem+rep)) %>%
dplyr::select(state,actual)) %>%
group_by(state) %>%
summarise(high = quantile(pct_clinton, 0.975),
low = quantile(pct_clinton, 0.025),
mean = mean(pct_clinton),
prob = mean(pct_clinton > 0.5),
actual = unique(actual),
p_value =
( 2*sum(pct_clinton < actual) + 1 ) /
( 2 * n() + 2) #,
# p_value_2 = mean(ifelse(actual >= mean(pct_clinton),
# pct_clinton >= actual,
# pct_clinton <= actual))
) %>%
arrange(p_value) %>%
mutate(year = 2016,
outside_ci = ifelse(actual > high | actual < low, TRUE,FALSE))

```

#### Map

```{r echo=F,message=F,warning=F}
map.2016.gg
```

#### Final electoral college histogram

```{r echo=F,message=F,warning=F}
final_evs.2016.gg
```

#### National and state polls and the electoral college over time

```{r echo=F,message=F,warning=F,fig.width = 9, fig.asp = .62}
grid.arrange(natl_polls.2016.gg, natl_evs.2016.gg, state_polls.2016.gg,
layout_matrix = rbind(c(1,1,3,3,3),
c(2,2,3,3,3)),
top = identifier
)
```

#### State vs national deltas over time

```{r echo=F,message=F,warning=F}
state_lean.2016.gg
```

#### Model results vs polls vs the prior

```{r echo=F,message=F,warning=F}
predictcompare.2016.gg
```

#### Performance

````{r echo=F,message=F,warning=F}
knitr::kable(briers.2016)
model_v_actual.2016.gg
model_v_actual.2016.rmse
```

#### Predictions for each state

```{r echo=F,message=F,warning=F}
pct_clinton %>%
filter(t == max(t)) %>%
mutate(se = (high - mean)/1.68) %>%
mutate(high = round(high,3),
low = round(low,3),
mean = round(mean,3),
prob = round(prob,3),
se = round(se,3)) %>%
select(state,mean,low,high,prob,se) %>%
arrange(abs(mean-0.5)) %>%
knitr::kable()

```

## Cumulative charts

### Probability calibration plot

```{r echo=F,message=F,warning=F}
calibration_data <- model_v_actual.2016 %>% mutate(year = 2016) %>%
bind_rows(model_v_actual.2012 %>% mutate(year = 2012)) %>%
bind_rows(model_v_actual.2008 %>% mutate(year = 2008))

calibration.gg <- ggplot(calibration_data, aes(x=prob,y=ifelse(actual>0.5,1,0),
label = paste0(state,'-',year))) +
geom_point() +
geom_text_repel(data = calibration_data %>%
filter(prob < 0.6 & actual>0.5 |
prob > 0.4 & actual<0.5 )) +
geom_smooth() +
geom_abline() +
scale_x_continuous(breaks=seq(0,1,0.1)) +
scale_y_continuous(breaks=seq(0,1,0.1)) +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
labs(x='Predicted Democratic win probability on election-day',
y='Result (1 = Dem win, 0 = Rep)')

print(calibration.gg)

```

### Confidence interval coverage

```{r echo=F, message=FALSE, warning=FALSE}

calibration_data <- p_values_2016 %>% mutate(year = 2016) %>%
bind_rows(p_values_2012 %>% mutate(year = 2012) ) %>%
bind_rows(p_values_2008 %>% mutate(year = 2008)) %>%
filter(state != 'DC')

# sum(calibration_data$outside_ci) / nrow(calibration_data)

p_value_hist.gg <- ggplot(calibration_data, aes(x=p_value)) +
geom_histogram(binwidth = 0.01,aes(group=paste0(state,year))) +
geom_vline(xintercept = 0.025,col='red',linetype=2) +
geom_vline(xintercept = 0.975,col='red',linetype=2) +
scale_x_continuous(breaks=seq(0,1,0.05)) +
scale_y_continuous(breaks=seq(0,100,2)) +
theme_minimal() +
theme(panel.grid.minor = element_blank())

preds_p_value_scatter.gg <- ggplot(calibration_data, aes(x=mean*100,y=actual*100)) +
geom_abline() +
# states inside the ci
geom_point(data =. %>% filter(!outside_ci),col='black') +
# states outside
geom_point(data = . %>% filter(outside_ci),col='red') +
geom_text_repel(data = . %>% filter(outside_ci),
aes(label=state),col='red',min.segment.length = 0) +
geom_segment(data = . %>% filter(outside_ci),
aes(x=mean*100,xend=mean*100,y=low*100,yend=high*100),col='red') +
facet_wrap(~year,ncol=3) +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
labs(
# title='Model with normal(0, sigma) distributions for polling error and state priors',
subtitle='Results in highlighted states fall outside our uncertainty intervals',
x='Predicted Democratic share of the two-party vote',
y='Actual Democratic share'
) +
coord_cartesian(xlim=c(25,75),ylim=c(25,75))

```

```{r echo=F,message=F,warning=F,fig.width = 9, fig.asp = .42}
print(preds_p_value_scatter.gg)
```

```{r echo=F,message=F,warning=F}
print(p_value_hist.gg)
```

# Licence

This software is published by _[The Economist](https://www.economist.com)_ under the [MIT licence](https://opensource.org/licenses/MIT). The data generated by _The Economist_ are available under the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).

The licences include only the data and the software authored by _The Economist_, and do not cover any _Economist_ content or third-party data or content made available using the software. More information about licensing, syndication and the copyright of _Economist_ content can be found [here](https://www.economist.com/rights/).