Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/theeconomist/us-potus-model

Code for a dynamic multilevel Bayesian model to predict US presidential elections. Written in R and Stan.
https://github.com/theeconomist/us-potus-model
Last synced: 18 days ago
JSON representation
Code for a dynamic multilevel Bayesian model to predict US presidential elections. Written in R and Stan.
Host: GitHub
URL: https://github.com/theeconomist/us-potus-model
Owner: TheEconomist
License: mit
Created: 2020-02-14T17:29:59.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2020-10-15T16:41:36.000Z (over 4 years ago)
Last Synced: 2024-11-14T21:14:04.175Z (3 months ago)
Language: HTML
Homepage: https://projects.economist.com/us-2020-forecast/president
Size: 166 MB
Stars: 1,245
Watchers: 99
Forks: 191
Open Issues: 5
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project

README

        ---

title: "State and national presidential election forecasting model"

subtitle: "By The Economist with Andrew Gelman and Merlin Heidemanns, Columbia University"

date: "Last update on `r format(Sys.time(), '%A %B %d, %Y at %I:%M %p %Z',tz='US/Eastern')`"

output: github_document

---

Code for a dynamic multilevel Bayesian model to predict US presidential elections. Written in R and Stan.

Improving on Pierre Kremp's [implementation](http://www.slate.com/features/pkremp_forecast/report.html) of Drew Linzer's dynamic linear model for election forecasting [(Linzer 2013)](https://votamatic.org/wp-content/uploads/2013/07/Linzer-JASA13.pdf), we (1) add corrections for partisan non-response, survey mode and survey population; (2) use informative state-level priors that update throughout the election year; and (3) specify empirical state-level correlations from political and demographic variables.

You can see the model's predictions for 2020 [here](https://projects.economist.com/us-2020-forecast/president) and read how it works [here](https://projects.economist.com/us-2020-forecast/president/how-this-works).

## File dictionary

In terms of useful files, you should pay attention to the 3 scripts for the 2008, 2012 and 2016 US presidential elections are located in the `scripts/model` directory. There are three R scripts that import data, run models and parse results:

* `final_model_2008.R`

* `final_model_2012.R`

* `final_model_2016.R`

And there are 3 different Stan scripts that will run different versions of our polling aggregate and election forecasting model:

* `poll_model_2020.stan` - the final model we use for the 2020 presidential election

* `poll_model_2020_no_mode_adjustment.stan` - a model that removes the correction for partisan non-response bias in the polls and the adjustments for the mode in which a survey is conducted (live phone, online, other) and its population (adult, likely voter, registered voter)

## Model performance 

Here is a graphical summary of the model's performance in 2008, 2012 and 2016.

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)

library(lubridate)

library(gridExtra)

library(boot)

library(pbapply)

library(urbnmapr)

library(ggrepel)

library(DT)

#  source('scripts/model/final_2016.R');source('scripts/model/final_2012.R');source('scripts/model/final_2008.R');beepr::beep(2);

```

### 2008 

```{r include=F}

## Master variables

RUN_DATE <- min(ymd('2008-11-03'),Sys.Date())

election_day <- ymd("2008-11-03")

start_date <- as.Date("2008-03-01") # Keeping all polls after March 1, 2016

  

evs_df <- read_csv('data/2012.csv')

# will read in the most recent back-test for 2008

out  <- read_rds(sprintf('models/backtest_2008/stan_model_%s_normal_final.rds',RUN_DATE))

# get all the polling data

#setwd(here("data/"))

all_polls <- read_csv('data/all_polls_2008.csv')

# select relevant columns from HufFPost polls

all_polls <- all_polls %>%

  dplyr::select(state, pollster, number.of.observations, mode,population,

                start.date, 

                end.date,

                obama, mccain, undecided, other)%>%

  mutate(end.date = mdy(end.date),

         start.date = mdy(start.date))

# make sure we've got nothing from the futuree

all_polls <- all_polls %>%

  filter(ymd(end.date) <= RUN_DATE)

# basic mutations

df <- all_polls %>% 

  tbl_df %>%

  rename(n = number.of.observations) %>%

  mutate(begin = ymd(start.date),

         end   = ymd(end.date),

         t = end - (1 + as.numeric(end-begin)) %/% 2) %>%

  filter(t >= start_date & !is.na(t)

         & n > 1) 

# pollster mutations

df <- df %>%

  mutate(pollster = str_extract(pollster, pattern = "[A-z0-9 ]+") %>% sub("\\s+$", "", .),

         pollster = replace(pollster, pollster == "Fox News", "FOX"), # Fixing inconsistencies in pollster names

         pollster = replace(pollster, pollster == "WashPost", "Washington Post"),

         pollster = replace(pollster, pollster == "ABC News", "ABC"),

         undecided = ifelse(is.na(undecided), 0, undecided),

         other = ifelse(is.na(other), 0, other))

# vote shares etc

df <- df %>%

  mutate(two_party_sum = obama + mccain,

         polltype = as.integer(as.character(recode(population, 

                                                   "Likely Voters" = "0", 

                                                   "Registered Voters" = "1",

                                                   "Adults" = "2"))), 

         n_respondents = round(n),

         # obama

         n_obama = round(n * obama/100),

         pct_obama = obama/two_party_sum,

         # mccain

         n_mccain = round(n * mccain/100),

         pct_mccain = mccain/two_party_sum,

         # third-party

         n_other = round(n * other/100),

         p_other = other/100)

# state info

state_correlation <- read.csv("data/potus_results_76_16.csv") %>% 

    select(year, state, dem) %>%

    group_by(state) %>%

    mutate(dem = dem )  %>% 

    spread(state,dem) %>% 

    dplyr::select(-year)

#here("data")

state_data <- read.csv("data/potus_results_76_16.csv")

all_state_names <- unique(state_data$state)

# Numerical indices passed to Stan for states, days, weeks, pollsters

df <- df %>% 

  mutate(poll_day = t - min(t) + 1,

         # Factors are alphabetically sorted: 1 = --, 2 = AL, 3 = AK, 4 = AZ...

         index_s = as.numeric(factor(as.character(state),

                                     levels = c('--',all_state_names))),  # ensure levels are same as all 50 names in sate_correlation

         index_t = 1 + as.numeric(t) - min(as.numeric(t)),

         index_p = as.numeric(as.factor(as.character(pollster))))  

T <- as.integer(round(difftime(election_day, min(df$start.date))))

# selections

df <- df %>%

  arrange(state, t, polltype, two_party_sum) %>% 

  distinct(state, t, pollster, .keep_all = TRUE) %>%

  select(

    # poll information

    state, t, begin, end, pollster, polltype, method = mode, n_respondents, 

    # vote shares

    pct_obama, n_obama, 

    pct_mccain, n_mccain, 

    p_other, n_other, poll_day, index_s, index_p, index_t) %>%

  mutate(index_s = ifelse(index_s == 1, 52, index_s - 1)) # national index = 51

# state weights

#setwd(here("data/"))

state_elec_results <- read.csv("data/2008.csv", 

                       header = TRUE, stringsAsFactors = FALSE) %>% 

  mutate(score = obama_count / (obama_count + mccain_count),

         national_score = sum(obama_count)/sum(obama_count + mccain_count),

         delta = score - national_score,

         share_national_vote = (total_count)

         /sum(total_count)) %>%

  arrange(state) 

rownames(state_elec_results) <- state_elec_results$state

# get state incdices

all_states <- state_elec_results$state

state_name <- state_elec_results$state_name

names(state_name) <- state_elec_results$state

# set prior differences

prior_diff_score <- state_elec_results$delta

names(prior_diff_score) <- all_states

# set state weights

state_weights <- c(state_elec_results$share_national_vote / sum(state_elec_results$share_national_vote))

names(state_weights) <- c(state_elec_results$state)

# electoral votes, by state:

ev_state <- state_elec_results$ev

names(ev_state) <- state_elec_results$state

# read in prior

prior_in <- read_csv("data/state_priors_08_12_16.csv") %>%

  filter(date <= RUN_DATE) %>%

  group_by(state) %>%

  arrange(date) %>%

  filter(date == max(date)) %>%

  select(state,pred) %>%

  ungroup() %>%

  arrange(state)

mu_b_prior <- logit(prior_in$pred)

names(mu_b_prior) <- prior_in$state

names(mu_b_prior) == names(prior_diff_score) # correct order?

national_mu_prior <- weighted.mean(inv.logit(mu_b_prior), state_weights)

# extract predictions

# states

predicted_score <- rstan::extract(out, pars = "predicted_score")[[1]]

pct_obama <- pblapply(1:dim(predicted_score)[3],

                    function(x){

                      # pred is mu_a + mu_b for the past, just mu_b for the future

                      temp <- predicted_score[,,x]

                      

                      # put in tibble

                      tibble(low = apply(temp,2,function(x){(quantile(x,0.025))}),

                             high = apply(temp,2,function(x){(quantile(x,0.975))}),

                             mean = apply(temp,2,function(x){(mean(x))}),

                             prob = apply(temp,2,function(x){(mean(x>0.5))}),

                             state = x) 

                      

                    }) %>% do.call('bind_rows',.)

pct_obama$state = colnames(state_correlation)[pct_obama$state]

pct_obama <- pct_obama %>%

  group_by(state) %>%

  mutate(t = row_number() + min(df$begin)) %>%

  ungroup()

# national

pct_obama_natl <- pblapply(1:dim(predicted_score)[2],

                         function(x){

                           # each row is a day for a particular draw

                           temp <- predicted_score[x,,] %>% as.data.frame()

                           names(temp) <- colnames(state_correlation)

                           

                           # for each row, get weigted natl vote

                           tibble(natl_vote = apply(temp,MARGIN = 1,function(y){weighted.mean(y,state_weights)})) %>%

                             mutate(t = row_number() + min(df$begin)) %>%

                             mutate(draw = x)

                         }) %>% do.call('bind_rows',.)

pct_obama_natl <- pct_obama_natl %>%

  group_by(t) %>%

  summarise(low = quantile(natl_vote,0.025),

            high = quantile(natl_vote,0.975),

            mean = mean(natl_vote),

            prob = mean(natl_vote > 0.5)) %>%

  mutate(state = '--')

# bind state and national vote

pct_obama <- pct_obama %>%

  bind_rows(pct_obama_natl) %>%

  arrange(desc(mean))

# look

ex_states <- c('IA','FL','OH','WI','MI','PA','AZ','NC','NH','TX','GA','MN')

pct_obama %>% filter(t == RUN_DATE,state %in% c(ex_states,'--')) %>% mutate(se = (high - mean)/1.96) %>% dplyr::select(-t)

map.2008.gg <- urbnmapr::states %>%

  left_join(pct_obama %>% filter(t == max(t)) %>%

              select(state_abbv=state,prob)) %>%

  ggplot(aes(x=long,y=lat,group=group,fill=prob)) +

  geom_polygon(col='gray40')  + 

  coord_map("albers",lat0=39, lat1=45) +

  scale_fill_gradient2(name='Democratic win probability',high='blue',low='red',mid='gray90',midpoint=0.5) +

  theme_void() + 

  theme(legend.position = 'top')

# electoral college by simulation

draws <- pblapply(1:dim(predicted_score)[3],

                  function(x){

                    # pred is mu_a + mu_b for the past, just mu_b for the future

                    pct_obama <- predicted_score[,,x]

                    

                    pct_obama <- pct_obama %>%

                      as.data.frame() %>%

                      mutate(draw = row_number()) %>%

                      gather(t,pct_obama,1:(ncol(.)-1)) %>%

                      mutate(t = as.numeric(gsub('V','',t)) + min(df$begin),

                             state = all_state_names[x]) 

                    

                    

                  }) %>% do.call('bind_rows',.)

sim_evs <- draws %>%

  left_join(state_elec_results %>% select(state,ev),by='state') %>%

  group_by(t,draw) %>%

  summarise(dem_ev = sum(ev * (pct_obama > 0.5))) %>%

  group_by(t) %>%

  summarise(mean_dem_ev = mean(dem_ev),

            median_dem_ev = median(dem_ev),

            high_dem_ev = quantile(dem_ev,0.975),

            low_dem_ev = quantile(dem_ev,0.025),

            prob = mean(dem_ev >= 270)) %>%

  left_join(pct_obama[pct_obama$state != '--',] %>%

              left_join(state_elec_results %>% select(state,ev),by='state') %>%

              group_by(t) %>%

              summarise(sum_dem_ev = sum(ev * (prob > 0.5))) )

# add identifier

identifier <- paste0(Sys.Date()," || " , out@model_name)

natl_polls.2008.gg <- pct_obama %>%

  filter(state == '--') %>%

  left_join(df %>% select(state,t,pct_obama)) %>% # plot over time

  # plot

  ggplot(.,aes(x=t)) +

  geom_ribbon(aes(ymin=low,ymax=high),col=NA,alpha=0.2) +

  geom_hline(yintercept = 0.5) +

  geom_hline(yintercept = national_mu_prior,linetype=2) +

  geom_point(aes(y=pct_obama),alpha=0.3) +

  geom_line(aes(y=mean)) +

  facet_wrap(~state) +

  theme_minimal()  +

  theme(legend.position = 'none') +

  scale_x_date(limits=c(ymd('2008-03-01','2008-11-03')),date_breaks='1 month',date_labels='%b') +

  scale_y_continuous(breaks=seq(0,1,0.02)) + 

  labs(subtitletitle=sprintf('obama natl pct | mean = %s | p(win) = %s',

                             round(pct_obama[pct_obama$state=='--' & pct_obama$t==election_day,]$mean*100,1),

                             round(pct_obama[pct_obama$state=='--' & pct_obama$t==election_day,]$prob,2)))

natl_evs.2008.gg <-  ggplot(sim_evs, aes(x=t)) +

  geom_hline(yintercept = 270) +

  geom_line(aes(y=median_dem_ev)) +

  geom_ribbon(aes(ymin=low_dem_ev,ymax=high_dem_ev),alpha=0.2) +

  theme_minimal()  +

  theme(legend.position = 'none') +

  scale_x_date(limits=c(ymd('2008-03-01','2008-11-03')),date_breaks='1 month',date_labels='%b') +

  labs(subtitletitle=sprintf('obama evs | median = %s | p(win) = %s',

                             round(sim_evs[sim_evs$t==election_day,]$median_dem_ev),

                             round(sim_evs[sim_evs$t==election_day,]$prob,2)))

state_polls.2008.gg <- pct_obama %>%

  filter(state %in% ex_states) %>%

  left_join(df %>% select(state,t,pct_obama)) %>% 

  left_join(tibble(state = names(mu_b_prior),

                   prior = inv.logit(mu_b_prior)) ) %>%

  ggplot(.,aes(x=t,col=state)) +

  geom_ribbon(aes(ymin=low,ymax=high),col=NA,alpha=0.2) +

  geom_hline(yintercept = 0.5) +

  geom_hline(aes(yintercept = prior),linetype=2) +

  geom_point(aes(y=pct_obama),alpha=0.3) +

  geom_line(aes(y=mean)) +

  facet_wrap(~state) +

  theme_minimal()  +

  theme(legend.position = 'top') +

  guides(color='none') +

  scale_x_date(limits=c(ymd('2008-03-01','2008-11-03')),date_breaks='1 month',date_labels='%b') +

  labs(subtitle='pct_obama state')

# diff from national over time?

state_lean.2008.gg <- pct_obama[pct_obama$state != '--',] %>%

  left_join(pct_obama[pct_obama$state=='--',] %>%

              select(t,pct_obama_national=mean), by='t') %>%

  mutate(diff=mean-pct_obama_national) %>%

  group_by(state) %>%

  mutate(last_prob = last(prob)) %>%

  filter(state %in% ex_states) %>%

  ggplot(.,aes(x=t,y=diff,col=state)) +

  geom_hline(yintercept=0.0) +

  geom_line() +

  geom_label_repel(data = . %>% 

                     filter(t==max(t),

                            state %in% ex_states),

                   aes(label=state)) +

  theme_minimal()  +

  theme(legend.position = 'none') +

  scale_x_date(limits=c(ymd('2008-03-01','2008-11-03')),date_breaks='1 month',date_labels='%b') +

  scale_y_continuous(breaks=seq(-1,1,0.01)) +

  labs(subtitle = identifier)

# brier scores

# https://www.buzzfeednews.com/article/jsvine/2016-election-forecast-grades

compare <- pct_obama %>% 

  filter(t==max(t),state!='--') %>% 

  select(state,obama_win = prob) %>% 

  mutate(obama_win_actual = ifelse(state %in% c('CA','NV','OR','WA','CO','NM','MN','IL','VA','DC','MD','DE','NJ','CT','RI','MA','NH','VT','NY','HI','ME','MI','IA','OH','PA','WI','FL','NC','IN'),1,0),

         diff = (obama_win_actual - obama_win )^2) %>% 

  left_join(enframe(ev_state) %>% set_names(.,c('state','ev'))) %>% 

  mutate(ev_weight = ev/(sum(ev))) 

briers.2008 <- tibble(outlet='economist (backtest)',

       ev_wtd_brier = weighted.mean(compare$diff, compare$ev_weight),

       unwtd_brier = mean(compare$diff),

       states_correct=sum(round(compare$obama_win) == round(compare$obama_win_actual)))

# compare model estiamte v actual results

model_v_actual.2008 <- pct_obama %>% 

  filter(t==max(t),state!='--') %>% 

  left_join(politicaldata::pres_results %>% filter(year == 2008) %>%

              mutate(actual = dem/(dem+rep)) %>%

              dplyr::select(state,actual)) %>%

  filter(state != 'DC')

model_v_actual.2008.rmse <- sqrt(mean(c((model_v_actual.2008$mean - model_v_actual.2008$actual)^2)))

model_v_actual.2008.gg <- ggplot(model_v_actual.2008,aes(x=mean,y=actual,label=state)) +

  geom_abline() + 

  geom_text() + 

  stat_smooth(method='lm') +

  theme_minimal()

# model vs final polls vs prior

predictcompare.2008.gg <- pct_obama %>%

  filter(t == max(t),state %in% ex_states) %>%

  mutate(se = (high - mean)/1.68) %>%

  select(state,model_mean=mean,model_se=se) %>%

  left_join(df %>%

              filter(t > (max(t)-14),

                     state %in% ex_states) %>%

              group_by(state) %>%

              summarise(poll = weighted.mean(pct_obama,n_respondents))) %>%

  left_join(enframe(mu_b_prior,'state','prior') %>%

              mutate(prior = inv.logit(prior))) %>%

  left_join(model_v_actual.2008 %>% dplyr::select(state,actual)) %>%

  ggplot(.,aes(y=state)) +

  geom_point(aes(x=poll,col='poll')) +

  geom_point(aes(x=model_mean,col='model')) +

  geom_point(aes(x=prior,col='prior')) +

  geom_point(aes(x=actual,col='result')) +

  theme_minimal()

# model better than polls?

 pct_obama %>%

  filter(t == max(t),state %in% ex_states) %>%

  mutate(se = (high - mean)/1.68) %>%

  select(state,model_mean=mean,model_se=se) %>%

  left_join(df %>%

              filter(t > (max(t)-14),

                     state %in% ex_states) %>%

              group_by(state) %>%

              summarise(poll = weighted.mean(pct_obama,n_respondents))) %>%

  left_join(enframe(mu_b_prior,'state','prior') %>%

              mutate(prior = inv.logit(prior))) %>%

  left_join(model_v_actual.2008 %>% dplyr::select(state,actual)) %>% 

   summarise(poll_error = mean(abs(poll-actual)),

             model_error = mean(abs(model_mean-actual)))

 

# expected EVs roughly match the stochastic simulations?

final_states <- enframe(mu_b_prior,'state','prior') %>%

  left_join(df %>%

              filter(t > (max(t)-14),) %>%

              group_by(state) %>%

              summarise(poll = weighted.mean(pct_obama,n_respondents))) %>%

  mutate(avg_poll_less_prior = mean(poll - inv.logit(prior), na.rm=T)) %>%

  mutate(poll = ifelse(is.na(poll),

                       inv.logit(prior) + avg_poll_less_prior,

                       poll)) %>%

  select(state,poll)

# final EV distribution

final_evs <- draws %>%

  left_join(state_elec_results %>% select(state,ev),by='state') %>%

  filter(t==max(t)) %>%

  group_by(draw) %>%

  summarise(dem_ev = sum(ev* (pct_obama > 0.5)))

final_evs.2008.gg <-  ggplot(final_evs,aes(x=dem_ev,

                       fill=ifelse(dem_ev>=270,'Democratic','Republican'))) +

    geom_vline(xintercept = 270) +

    geom_histogram(binwidth=1) +

    theme_minimal() +

    theme(legend.position = 'top',

          panel.grid.minor = element_blank()) +

    scale_fill_manual(name='Electoral College winner',values=c('Democratic'='#3A4EB1','Republican'='#E40A04')) +

    labs(x='Democratic electoral college votes',title='2008, Potential Electoral College outcomes',

         subtitle=sprintf("p(dem win) = %s | full stan model",round(mean(final_evs$dem_ev>=270),3)))

# get the p value of the actual results

p_values_2008 <- draws %>% 

  filter(state != '--',

         t == election_day) %>% 

  left_join(politicaldata::pres_results %>% filter(year == 2008) %>%

              mutate(actual = dem/(dem+rep)) %>%

              dplyr::select(state,actual)) %>%

  group_by(state) %>%

  summarise(high = quantile(pct_obama, 0.975),

            low = quantile(pct_obama, 0.025),

            mean = mean(pct_obama),

            prob = mean(pct_obama > 0.5),

            actual = unique(actual),

            p_value = 

           ( 2*sum(pct_obama < actual) + 1 ) /

           ( 2 * n() + 2) #,

           # p_value_2 = mean(ifelse(actual >= mean(pct_obama),

           #                         pct_obama >= actual,

           #                         pct_obama <= actual))

           ) %>%

  arrange(p_value) %>% 

  mutate(year = 2008,

         outside_ci = ifelse(actual > high | actual < low, TRUE,FALSE))

```

#### Map

```{r echo=F,message=F,warning=F}

map.2008.gg

```

#### Final electoral college histogram

```{r echo=F,message=F,warning=F}

final_evs.2008.gg

```

#### National and state polls and the electoral college over time

```{r echo=F,message=F,warning=F,fig.width = 9, fig.asp = .62}

grid.arrange(natl_polls.2008.gg, natl_evs.2008.gg, state_polls.2008.gg, 

             layout_matrix = rbind(c(1,1,3,3,3),

                                   c(2,2,3,3,3)),

             top = identifier

)

```

#### State vs national deltas over time

```{r echo=F,message=F,warning=F}

state_lean.2008.gg

```

#### Model results vs polls vs the prior

```{r echo=F,message=F,warning=F}

predictcompare.2008.gg

```

#### Performance

````{r echo=F,message=F,warning=F}

knitr::kable(briers.2008)

model_v_actual.2008.gg

model_v_actual.2008.rmse

```

#### Predictions for each state

```{r echo=F,message=F,warning=F}

pct_obama %>% 

  filter(t == max(t)) %>% 

  mutate(se = (high - mean)/1.68) %>%

  mutate(high = round(high,3),

         low = round(low,3),

         mean = round(mean,3),

         prob = round(prob,3),

         se = round(se,3)) %>%

  select(state,mean,low,high,prob,se) %>%

  arrange(abs(mean-0.5)) %>%

  knitr::kable()

            

```

### 2012 

```{r include=F}

## Master variables

RUN_DATE <- min(ymd('2012-11-06'),Sys.Date())

election_day <- ymd("2012-11-06")

start_date <- as.Date("2012-03-01") # Keeping all polls after March 1, 2016

  

evs_df <- read_csv('data/2012.csv')

# will read in the most recent back-test for 2012

out  <- read_rds(sprintf('models/backtest_2012/stan_model_%s_normal_final.rds',RUN_DATE))

# get all the polling data

#setwd(here("data/"))

all_polls <- read_csv('data/all_polls_2012.csv')

# select relevant columns from HufFPost polls

all_polls <- all_polls %>%

  dplyr::select(state, pollster, number.of.observations, mode,population,

                start.date, 

                end.date,

                obama, romney, undecided, other)%>%

  mutate(end.date = mdy(end.date),

         start.date = mdy(start.date))

# make sure we've got nothing from the futuree

all_polls <- all_polls %>%

  filter(ymd(end.date) <= RUN_DATE)

# basic mutations

df <- all_polls %>% 

  tbl_df %>%

  rename(n = number.of.observations) %>%

  mutate(begin = ymd(start.date),

         end   = ymd(end.date),

         t = end - (1 + as.numeric(end-begin)) %/% 2) %>%

  filter(t >= start_date & !is.na(t)

         & n > 1) 

# pollster mutations

df <- df %>%

  mutate(pollster = str_extract(pollster, pattern = "[A-z0-9 ]+") %>% sub("\\s+$", "", .),

         pollster = replace(pollster, pollster == "Fox News", "FOX"), # Fixing inconsistencies in pollster names

         pollster = replace(pollster, pollster == "WashPost", "Washington Post"),

         pollster = replace(pollster, pollster == "ABC News", "ABC"),

         undecided = ifelse(is.na(undecided), 0, undecided),

         other = ifelse(is.na(other), 0, other))

# vote shares etc

df <- df %>%

  mutate(two_party_sum = obama + romney,

         polltype = as.integer(as.character(recode(population, 

                                                   "Likely Voters" = "0", 

                                                   "Registered Voters" = "1",

                                                   "Adults" = "2"))), 

         n_respondents = round(n),

         # obama

         n_obama = round(n * obama/100),

         pct_obama = obama/two_party_sum,

         # romney

         n_romney = round(n * romney/100),

         pct_romney = romney/two_party_sum,

         # third-party

         n_other = round(n * other/100),

         p_other = other/100)

# state info

#here("data")

state_data <- read.csv("data/potus_results_76_16.csv")

all_state_names <- unique(state_data$state)

# Numerical indices passed to Stan for states, days, weeks, pollsters

df <- df %>% 

  mutate(poll_day = t - min(t) + 1,

         # Factors are alphabetically sorted: 1 = --, 2 = AL, 3 = AK, 4 = AZ...

         index_s = as.numeric(factor(as.character(state),

                                     levels = c('--',all_state_names))),  # ensure levels are same as all 50 names in sate_correlation

         index_t = 1 + as.numeric(t) - min(as.numeric(t)),

         index_p = as.numeric(as.factor(as.character(pollster))))  

T <- as.integer(round(difftime(election_day, min(df$start.date))))

# selections

df <- df %>%

  arrange(state, t, polltype, two_party_sum) %>% 

  distinct(state, t, pollster, .keep_all = TRUE) %>%

  select(

    # poll information

    state, t, begin, end, pollster, polltype, method = mode, n_respondents, 

    # vote shares

    pct_obama, n_obama, 

    pct_romney, n_romney, 

    p_other, n_other, poll_day, index_s, index_p, index_t) %>%

  mutate(index_s = ifelse(index_s == 1, 52, index_s - 1)) # national index = 51

# state weights

#setwd(here("data/"))

state_elec_results <- read.csv("data/2012.csv", 

                       header = TRUE, stringsAsFactors = FALSE) %>% 

  mutate(score = obama_count / (obama_count + romney_count),

         national_score = sum(obama_count)/sum(obama_count + romney_count),

         delta = score - national_score,

         share_national_vote = (total_count)

         /sum(total_count)) %>%

  arrange(state) 

rownames(state_elec_results) <- state_elec_results$state

# get state incdices

all_states <- state_elec_results$state

state_name <- state_elec_results$state_name

names(state_name) <- state_elec_results$state

# set prior differences

prior_diff_score <- state_elec_results$delta

names(prior_diff_score) <- all_states

# set state weights

state_weights <- c(state_elec_results$share_national_vote / sum(state_elec_results$share_national_vote))

names(state_weights) <- c(state_elec_results$state)

# electoral votes, by state:

ev_state <- state_elec_results$ev

names(ev_state) <- state_elec_results$state

# read in prior

prior_in <- read_csv("data/state_priors_08_12_16.csv") %>%

  filter(date <= RUN_DATE) %>%

  group_by(state) %>%

  arrange(date) %>%

  filter(date == max(date)) %>%

  select(state,pred) %>%

  ungroup() %>%

  arrange(state)

mu_b_prior <- logit(prior_in$pred)

names(mu_b_prior) <- prior_in$state

names(mu_b_prior) == names(prior_diff_score) # correct order?

national_mu_prior <- weighted.mean(inv.logit(mu_b_prior), state_weights)

# extract predictions

predicted_score <- rstan::extract(out, pars = "predicted_score")[[1]]

# states

pct_obama <- pblapply(1:dim(predicted_score)[3],

                    function(x){

                      # pred is mu_a + mu_b for the past, just mu_b for the future

                      temp <- predicted_score[,,x]

                      

                      # put in tibble

                      tibble(low = apply(temp,2,function(x){(quantile(x,0.025))}),

                             high = apply(temp,2,function(x){(quantile(x,0.975))}),

                             mean = apply(temp,2,function(x){(mean(x))}),

                             prob = apply(temp,2,function(x){(mean(x>0.5))}),

                             state = x) 

                      

                    }) %>% do.call('bind_rows',.)

pct_obama$state = colnames(state_correlation)[pct_obama$state]

pct_obama <- pct_obama %>%

  group_by(state) %>%

  mutate(t = row_number() + min(df$begin)) %>%

  ungroup()

# national

pct_obama_natl <- pblapply(1:dim(predicted_score)[2],

                         function(x){

                           # each row is a day for a particular draw

                           temp <- predicted_score[x,,] %>% as.data.frame()

                           names(temp) <- colnames(state_correlation)

                           

                           # for each row, get weigted natl vote

                           tibble(natl_vote = apply(temp,MARGIN = 1,function(y){weighted.mean(y,state_weights)})) %>%

                             mutate(t = row_number() + min(df$begin)) %>%

                             mutate(draw = x)

                         }) %>% do.call('bind_rows',.)

pct_obama_natl <- pct_obama_natl %>%

  group_by(t) %>%

  summarise(low = quantile(natl_vote,0.025),

            high = quantile(natl_vote,0.975),

            mean = mean(natl_vote),

            prob = mean(natl_vote > 0.5)) %>%

  mutate(state = '--')

# bind state and national vote

pct_obama <- pct_obama %>%

  bind_rows(pct_obama_natl) %>%

  arrange(desc(mean))

# look

ex_states <- c('IA','FL','OH','WI','MI','PA','AZ','NC','NH','TX','GA','MN')

pct_obama %>% filter(t == RUN_DATE,state %in% c(ex_states,'--')) %>% mutate(se = (high - mean)/1.96) %>% dplyr::select(-t)

map.2012.gg <- urbnmapr::states %>%

  left_join(pct_obama %>% filter(t == max(t)) %>%

              select(state_abbv=state,prob)) %>%

  ggplot(aes(x=long,y=lat,group=group,fill=prob)) +

  geom_polygon(col='gray40')  + 

  coord_map("albers",lat0=39, lat1=45) +

  scale_fill_gradient2(name='Democratic win probability',high='blue',low='red',mid='gray90',midpoint=0.5) +

  theme_void() + 

  theme(legend.position = 'top')

# electoral college by simulation

draws <- pblapply(1:dim(predicted_score)[3],

                  function(x){

                    # pred is mu_a + mu_b for the past, just mu_b for the future

                    pct_obama <- predicted_score[,,x]

                    

                    pct_obama <- pct_obama %>%

                      as.data.frame() %>%

                      mutate(draw = row_number()) %>%

                      gather(t,pct_obama,1:(ncol(.)-1)) %>%

                      mutate(t = as.numeric(gsub('V','',t)) + min(df$begin),

                             state = all_state_names[x]) 

                    

                    

                  }) %>% do.call('bind_rows',.)

sim_evs <- draws %>%

  left_join(state_elec_results %>% select(state,ev),by='state') %>%

  group_by(t,draw) %>%

  summarise(dem_ev = sum(ev * (pct_obama > 0.5))) %>%

  group_by(t) %>%

  summarise(mean_dem_ev = mean(dem_ev),

            median_dem_ev = median(dem_ev),

            high_dem_ev = quantile(dem_ev,0.975),

            low_dem_ev = quantile(dem_ev,0.025),

            prob = mean(dem_ev >= 270)) %>%

  left_join(pct_obama[pct_obama$state != '--',] %>%

              left_join(state_elec_results %>% select(state,ev),by='state') %>%

              group_by(t) %>%

              summarise(sum_dem_ev = sum(ev * (prob > 0.5))) )

# add identifier

identifier <- paste0(Sys.Date()," || " , out@model_name)

natl_polls.2012.gg <- pct_obama %>%

  filter(state == '--') %>%

  left_join(df %>% select(state,t,pct_obama)) %>% # plot over time

  # plot

  ggplot(.,aes(x=t)) +

  geom_ribbon(aes(ymin=low,ymax=high),col=NA,alpha=0.2) +

  geom_hline(yintercept = 0.5) +

  geom_hline(yintercept = national_mu_prior,linetype=2) +

  geom_point(aes(y=pct_obama),alpha=0.3) +

  geom_line(aes(y=mean)) +

  facet_wrap(~state) +

  theme_minimal()  +

  theme(legend.position = 'none') +

  scale_x_date(limits=c(ymd('2012-03-01','2012-11-06')),date_breaks='1 month',date_labels='%b') +

  scale_y_continuous(breaks=seq(0,1,0.02)) + 

  labs(subtitletitle=sprintf('obama natl pct | mean = %s | p(win) = %s',

                             round(pct_obama[pct_obama$state=='--' & pct_obama$t==election_day,]$mean*100,1),

                             round(pct_obama[pct_obama$state=='--' & pct_obama$t==election_day,]$prob,2)))

natl_evs.2012.gg <-  ggplot(sim_evs, aes(x=t)) +

  geom_hline(yintercept = 270) +

  geom_line(aes(y=median_dem_ev)) +

  geom_ribbon(aes(ymin=low_dem_ev,ymax=high_dem_ev),alpha=0.2) +

  theme_minimal()  +

  theme(legend.position = 'none') +

  scale_x_date(limits=c(ymd('2012-03-01','2012-11-06')),date_breaks='1 month',date_labels='%b') +

  labs(subtitletitle=sprintf('obama evs | median = %s | p(win) = %s',

                             round(sim_evs[sim_evs$t==election_day,]$median_dem_ev),

                             round(sim_evs[sim_evs$t==election_day,]$prob,2)))

state_polls.2012.gg <- pct_obama %>%

  filter(state %in% ex_states) %>%

  left_join(df %>% select(state,t,pct_obama)) %>% 

  left_join(tibble(state = names(mu_b_prior),

                   prior = inv.logit(mu_b_prior)) ) %>%

  ggplot(.,aes(x=t,col=state)) +

  geom_ribbon(aes(ymin=low,ymax=high),col=NA,alpha=0.2) +

  geom_hline(yintercept = 0.5) +

  geom_hline(aes(yintercept = prior),linetype=2) +

  geom_point(aes(y=pct_obama),alpha=0.3) +

  geom_line(aes(y=mean)) +

  facet_wrap(~state) +

  theme_minimal()  +

  theme(legend.position = 'top') +

  guides(color='none') +

  scale_x_date(limits=c(ymd('2012-03-01','2012-11-06')),date_breaks='1 month',date_labels='%b') +

  labs(subtitle='pct_obama state')

# diff from national over time?

state_lean.2012.gg <- pct_obama[pct_obama$state != '--',] %>%

  left_join(pct_obama[pct_obama$state=='--',] %>%

              select(t,pct_obama_national=mean), by='t') %>%

  mutate(diff=mean-pct_obama_national) %>%

  group_by(state) %>%

  mutate(last_prob = last(prob)) %>%

  filter(state %in% ex_states) %>%

  ggplot(.,aes(x=t,y=diff,col=state)) +

  geom_hline(yintercept=0.0) +

  geom_line() +

  geom_label_repel(data = . %>% 

                     filter(t==max(t),

                            state %in% ex_states),

                   aes(label=state)) +

  theme_minimal()  +

  theme(legend.position = 'none') +

  scale_x_date(limits=c(ymd('2012-03-01','2012-11-06')),date_breaks='1 month',date_labels='%b') +

  scale_y_continuous(breaks=seq(-1,1,0.01)) +

  labs(subtitle = identifier)

# brier scores

# https://www.buzzfeednews.com/article/jsvine/2016-election-forecast-grades

compare <- pct_obama %>% 

  filter(t==max(t),state!='--') %>% 

  select(state,obama_win = prob) %>% 

  mutate(obama_win_actual = ifelse(state %in% c('CA','NV','OR','WA','CO','NM','MN','IL','VA','DC','MD','DE','NJ','CT','RI','MA','NH','VT','NY','HI','ME','MI','IA','OH','PA','WI','FL'),1,0),

         diff = (obama_win_actual - obama_win )^2) %>% 

  left_join(enframe(ev_state) %>% set_names(.,c('state','ev'))) %>% 

  mutate(ev_weight = ev/(sum(ev))) 

briers.2012 <- tibble(outlet = c('Linzer','Wang/Ferguson','Silver/538','Jackman/Pollster','Desart/Holbrook',

                  'Intrade','Enten/Margin of Error'),

       ev_wtd_brier = rep(NA,7),

       unwtd_brier = c(0.0038,0.00761,0.00911,0.00971,0.01605,0.02812,0.05075),

) %>% 

  bind_rows(tibble(outlet='economist (backtest)',

                   ev_wtd_brier = weighted.mean(compare$diff, compare$ev_weight),

                   unwtd_brier = mean(compare$diff),

                   states_correct=sum(round(compare$obama_win) == round(compare$obama_win_actual)))) %>%

  arrange(unwtd_brier)

# final v result

model_v_actual.2012 <- pct_obama %>% 

    filter(t==max(t),state!='--') %>% 

    left_join(politicaldata::pres_results %>% filter(year == 2012) %>%

                  mutate(actual = dem/(dem+rep)) %>%

                  dplyr::select(state,actual)) %>%

    filter(state != 'DC')

model_v_actual.2012.rmse <- sqrt(mean(c((model_v_actual.2012$mean - model_v_actual.2012$actual)^2)))

model_v_actual.2012.gg <- ggplot(model_v_actual.2012,aes(x=mean,y=actual,label=state)) +

    geom_abline() + 

    geom_text() + 

    stat_smooth(method='lm') +

    theme_minimal()

# model vs final polls vs prior

predictcompare.2012.gg <- pct_obama %>%

  filter(t == max(t),state %in% ex_states) %>%

  mutate(se = (high - mean)/1.68) %>%

  select(state,model_mean=mean,model_se=se) %>%

  left_join(df %>%

              filter(t > (max(t)-14),

                     state %in% ex_states) %>%

              group_by(state) %>%

              summarise(poll = weighted.mean(pct_obama,n_respondents))) %>%

  left_join(enframe(mu_b_prior,'state','prior') %>%

              mutate(prior = inv.logit(prior))) %>%

  left_join(model_v_actual.2012 %>% dplyr::select(state,actual)) %>%

  ggplot(.,aes(y=state)) +

  geom_point(aes(x=poll,col='poll')) +

  geom_point(aes(x=model_mean,col='model')) +

  geom_point(aes(x=prior,col='prior')) +

  geom_point(aes(x=actual,col='result')) +

  theme_minimal()

# model better than polls?

 pct_obama %>%

  filter(t == max(t),state %in% ex_states) %>%

  mutate(se = (high - mean)/1.68) %>%

  select(state,model_mean=mean,model_se=se) %>%

  left_join(df %>%

              filter(t > (max(t)-14),

                     state %in% ex_states) %>%

              group_by(state) %>%

              summarise(poll = weighted.mean(pct_obama,n_respondents))) %>%

  left_join(enframe(mu_b_prior,'state','prior') %>%

              mutate(prior = inv.logit(prior))) %>%

  left_join(model_v_actual.2012 %>% dplyr::select(state,actual)) %>% 

   summarise(poll_error = mean(abs(poll-actual)),

             model_error = mean(abs(model_mean-actual)))

 

# expected EVs roughly match the stochastic simulations?

final_states <- enframe(mu_b_prior,'state','prior') %>%

  left_join(df %>%

              filter(t > (max(t)-14),) %>%

              group_by(state) %>%

              summarise(poll = weighted.mean(pct_obama,n_respondents))) %>%

  mutate(avg_poll_less_prior = mean(poll - inv.logit(prior), na.rm=T)) %>%

  mutate(poll = ifelse(is.na(poll),

                       inv.logit(prior) + avg_poll_less_prior,

                       poll)) %>%

  select(state,poll)

# final EV distribution

final_evs <- draws %>%

  left_join(state_elec_results %>% select(state,ev),by='state') %>%

  filter(t==max(t)) %>%

  group_by(draw) %>%

  summarise(dem_ev = sum(ev* (pct_obama > 0.5)))

final_evs.2012.gg <-  ggplot(final_evs,aes(x=dem_ev,

                       fill=ifelse(dem_ev>=270,'Democratic','Republican'))) +

    geom_vline(xintercept = 270) +

    geom_histogram(binwidth=1) +

    theme_minimal() +

    theme(legend.position = 'top',

          panel.grid.minor = element_blank()) +

    scale_fill_manual(name='Electoral College winner',values=c('Democratic'='#3A4EB1','Republican'='#E40A04')) +

    labs(x='Democratic electoral college votes',title='2012, Potential Electoral College outcomes',

         subtitle=sprintf("p(dem win) = %s | full stan model",round(mean(final_evs$dem_ev>=270),3)))

# get the p value of the actual results

p_values_2012 <- draws %>% 

  filter(state != '--',

         t == election_day) %>% 

  left_join(politicaldata::pres_results %>% filter(year == 2012) %>%

              mutate(actual = dem/(dem+rep)) %>%

              dplyr::select(state,actual)) %>%

  group_by(state) %>%

  summarise(high = quantile(pct_obama, 0.975),

            low = quantile(pct_obama, 0.025),

            mean = mean(pct_obama),

            prob = mean(pct_obama > 0.5),

            actual = unique(actual),

            p_value = 

           ( 2*sum(pct_obama < actual) + 1 ) /

           ( 2 * n() + 2) #,

           # p_value_2 = mean(ifelse(actual >= mean(pct_obama),

           #                         pct_obama >= actual,

           #                         pct_obama <= actual))

           ) %>%

  arrange(p_value) %>% 

  mutate(year = 2012,

         outside_ci = ifelse(actual > high | actual < low, TRUE,FALSE))

```

#### Map

```{r echo=F,message=F,warning=F}

map.2012.gg

```

#### Final electoral college histogram

```{r echo=F,message=F,warning=F}

final_evs.2012.gg

```

#### National and state polls and the electoral college over time

```{r echo=F,message=F,warning=F,fig.width = 9, fig.asp = .62}

grid.arrange(natl_polls.2012.gg, natl_evs.2012.gg, state_polls.2012.gg, 

             layout_matrix = rbind(c(1,1,3,3,3),

                                   c(2,2,3,3,3)),

             top = identifier

)

```

#### State vs national deltas over time

```{r echo=F,message=F,warning=F}

state_lean.2012.gg

```

#### Model results vs polls vs the prior

```{r echo=F,message=F,warning=F}

predictcompare.2012.gg

```

#### Performance

````{r echo=F,message=F,warning=F}

knitr::kable(briers.2012)

model_v_actual.2012.gg

model_v_actual.2012.rmse

```

#### Predictions for each state

```{r echo=F,message=F,warning=F}

pct_obama %>% 

  filter(t == max(t)) %>% 

  mutate(se = (high - mean)/1.68) %>%

  mutate(high = round(high,3),

         low = round(low,3),

         mean = round(mean,3),

         prob = round(prob,3),

         se = round(se,3)) %>%

  select(state,mean,low,high,prob,se) %>%

  arrange(abs(mean-0.5)) %>%

  knitr::kable()

            

```

### 2016 

```{r include=F}

## Master variables

RUN_DATE <- min(ymd('2016-11-08'),Sys.Date())

election_day <- ymd("2016-11-08")

start_date <- as.Date("2016-03-01") # Keeping all polls after March 1, 2016

  

evs_df <- read_csv('data/2012.csv')

# will read in the most recent back-test for 2012

out  <- read_rds(sprintf('models/stan_model_%s_normal_final.rds',RUN_DATE))

# get all the polling data

#setwd(here("data/"))

all_polls <- read_csv('data/all_polls.csv')

# select relevant columns from HufFPost polls

all_polls <- all_polls %>%

  dplyr::select(state, pollster, number.of.observations, mode,population,

                start.date, 

                end.date,

                clinton, trump, undecided, other)%>%

  mutate(end.date = ymd(end.date),

         start.date = ymd(start.date))

# make sure we've got nothing from the futuree

all_polls <- all_polls %>%

  filter(ymd(end.date) <= RUN_DATE)

# basic mutations

df <- all_polls %>% 

  tbl_df %>%

  rename(n = number.of.observations) %>%

  mutate(begin = ymd(start.date),

         end   = ymd(end.date),

         t = end - (1 + as.numeric(end-begin)) %/% 2) %>%

  filter(t >= start_date & !is.na(t)

         & n > 1) 

# pollster mutations

df <- df %>%

  mutate(pollster = str_extract(pollster, pattern = "[A-z0-9 ]+") %>% sub("\\s+$", "", .),

         pollster = replace(pollster, pollster == "Fox News", "FOX"), # Fixing inconsistencies in pollster names

         pollster = replace(pollster, pollster == "WashPost", "Washington Post"),

         pollster = replace(pollster, pollster == "ABC News", "ABC"),

         undecided = ifelse(is.na(undecided), 0, undecided),

         other = ifelse(is.na(other), 0, other))

# vote shares etc

df <- df %>%

  mutate(two_party_sum = clinton + trump,

         polltype = as.integer(as.character(recode(population, 

                                                   "Likely Voters" = "0", 

                                                   "Registered Voters" = "1",

                                                   "Adults" = "2"))), 

         n_respondents = round(n),

         # clinton

         n_clinton = round(n * clinton/100),

         pct_clinton = clinton/two_party_sum,

         # trump

         n_trump = round(n * trump/100),

         pct_trump = trump/two_party_sum,

         # third-party

         n_other = round(n * other/100),

         p_other = other/100)

# state info

state_correlation <- read.csv("data/potus_results_76_16.csv") %>% 

    select(year, state, dem) %>%

    group_by(state) %>%

    mutate(dem = dem )  %>% 

    spread(state,dem) %>% 

    dplyr::select(-year)

#here("data")

state_data <- read.csv("data/potus_results_76_16.csv")

all_state_names <- unique(state_data$state)

# Numerical indices passed to Stan for states, days, weeks, pollsters

df <- df %>% 

  mutate(poll_day = t - min(t) + 1,

         # Factors are alphabetically sorted: 1 = --, 2 = AL, 3 = AK, 4 = AZ...

         index_s = as.numeric(factor(as.character(state),

                                     levels = c('--',all_state_names))),  # ensure levels are same as all 50 names in sate_correlation

         index_t = 1 + as.numeric(t) - min(as.numeric(t)),

         index_p = as.numeric(as.factor(as.character(pollster))))  

T <- as.integer(round(difftime(election_day, min(df$start.date))))

# selections

df <- df %>%

  arrange(state, t, polltype, two_party_sum) %>% 

  distinct(state, t, pollster, .keep_all = TRUE) %>%

  select(

    # poll information

    state, t, begin, end, pollster, polltype, method = mode, n_respondents, 

    # vote shares

    pct_clinton, n_clinton, 

    pct_trump, n_trump, 

    p_other, n_other, poll_day, index_s, index_p, index_t) %>%

  mutate(index_s = ifelse(index_s == 1, 52, index_s - 1)) # national index = 51

# state weights

#setwd(here("data/"))

state_elec_results <- read.csv("data/2012.csv", 

                       header = TRUE, stringsAsFactors = FALSE) %>% 

  mutate(score = obama_count / (obama_count + romney_count),

         national_score = sum(obama_count)/sum(obama_count + romney_count),

         delta = score - national_score,

         share_national_vote = (total_count)

         /sum(total_count)) %>%

  arrange(state) 

rownames(state_elec_results) <- state_elec_results$state

# get state incdices

all_states <- state_elec_results$state

state_name <- state_elec_results$state_name

names(state_name) <- state_elec_results$state

# set prior differences

prior_diff_score <- state_elec_results$delta

names(prior_diff_score) <- all_states

# set state weights

state_weights <- c(state_elec_results$share_national_vote / sum(state_elec_results$share_national_vote))

names(state_weights) <- c(state_elec_results$state)

# electoral votes, by state:

ev_state <- state_elec_results$ev

names(ev_state) <- state_elec_results$state

# read in prior

prior_in <- read_csv("data/state_priors_08_12_16.csv") %>%

  filter(date <= RUN_DATE) %>%

  group_by(state) %>%

  arrange(date) %>%

  filter(date == max(date)) %>%

  select(state,pred) %>%

  ungroup() %>%

  arrange(state)

mu_b_prior <- logit(prior_in$pred)

names(mu_b_prior) <- prior_in$state

names(mu_b_prior) == names(prior_diff_score) # correct order?

national_mu_prior <- weighted.mean(inv.logit(mu_b_prior), state_weights)

# extract predictions

predicted_score <- rstan::extract(out, pars = "predicted_score")[[1]]

# states

pct_clinton <- pblapply(1:dim(predicted_score)[3],

                    function(x){

                      # pred is mu_a + mu_b for the past, just mu_b for the future

                      temp <- predicted_score[,,x]

                      

                      # put in tibble

                      tibble(low = apply(temp,2,function(x){(quantile(x,0.025))}),

                             high = apply(temp,2,function(x){(quantile(x,0.975))}),

                             mean = apply(temp,2,function(x){(mean(x))}),

                             prob = apply(temp,2,function(x){(mean(x>0.5))}),

                             state = x) 

                      

                    }) %>% do.call('bind_rows',.)

pct_clinton$state = colnames(state_correlation)[pct_clinton$state]

pct_clinton <- pct_clinton %>%

  group_by(state) %>%

  mutate(t = row_number() + min(df$begin)) %>%

  ungroup()

# national

pct_clinton_natl <- pblapply(1:dim(predicted_score)[2],

                         function(x){

                           # each row is a day for a particular draw

                           temp <- predicted_score[x,,] %>% as.data.frame()

                           names(temp) <- colnames(state_correlation)

                           

                           # for each row, get weigted natl vote

                           tibble(natl_vote = apply(temp,MARGIN = 1,function(y){weighted.mean(y,state_weights)})) %>%

                             mutate(t = row_number() + min(df$begin)) %>%

                             mutate(draw = x)

                         }) %>% do.call('bind_rows',.)

pct_clinton_natl <- pct_clinton_natl %>%

  group_by(t) %>%

  summarise(low = quantile(natl_vote,0.025),

            high = quantile(natl_vote,0.975),

            mean = mean(natl_vote),

            prob = mean(natl_vote > 0.5)) %>%

  mutate(state = '--')

# bind state and national vote

pct_clinton <- pct_clinton %>%

  bind_rows(pct_clinton_natl) %>%

  arrange(desc(mean))

# look

ex_states <- c('IA','FL','OH','WI','MI','PA','AZ','NC','NH','TX','GA','MN')

pct_clinton %>% filter(t == RUN_DATE,state %in% c(ex_states,'--')) %>% mutate(se = (high - mean)/1.96) %>% dplyr::select(-t)

map.2016.gg <- urbnmapr::states %>%

  left_join(pct_clinton %>% filter(t == max(t)) %>%

              select(state_abbv=state,prob)) %>%

  ggplot(aes(x=long,y=lat,group=group,fill=prob)) +

  geom_polygon(col='gray40')  + 

  coord_map("albers",lat0=39, lat1=45) +

  scale_fill_gradient2(name='Democratic win probability',high='blue',low='red',mid='gray90',midpoint=0.5) +

  theme_void() + 

  theme(legend.position = 'top')

# electoral college by simulation

draws <- pblapply(1:dim(predicted_score)[3],

                  function(x){

                    # pred is mu_a + mu_b for the past, just mu_b for the future

                    pct_clinton <- predicted_score[,,x]

                    

                    pct_clinton <- pct_clinton %>%

                      as.data.frame() %>%

                      mutate(draw = row_number()) %>%

                      gather(t,pct_clinton,1:(ncol(.)-1)) %>%

                      mutate(t = as.numeric(gsub('V','',t)) + min(df$begin),

                             state = all_state_names[x]) 

                    

                    

                  }) %>% do.call('bind_rows',.)

sim_evs <- draws %>%

  left_join(state_elec_results %>% select(state,ev),by='state') %>%

  group_by(t,draw) %>%

  summarise(dem_ev = sum(ev * (pct_clinton > 0.5))) %>%

  group_by(t) %>%

  summarise(mean_dem_ev = mean(dem_ev),

            median_dem_ev = median(dem_ev),

            high_dem_ev = quantile(dem_ev,0.975),

            low_dem_ev = quantile(dem_ev,0.025),

            prob = mean(dem_ev >= 270)) %>%

  left_join(pct_clinton[pct_clinton$state != '--',] %>%

              left_join(state_elec_results %>% select(state,ev),by='state') %>%

              group_by(t) %>%

              summarise(sum_dem_ev = sum(ev * (prob > 0.5))) )

# add identifier

identifier <- paste0(Sys.Date()," || " , out@model_name)

natl_polls.2016.gg <- pct_clinton %>%

  filter(state == '--') %>%

  left_join(df %>% select(state,t,pct_clinton,method)) %>% # plot over time

  # plot

  ggplot(.,aes(x=t)) +

  geom_ribbon(aes(ymin=low,ymax=high),col=NA,alpha=0.2) +

  geom_hline(yintercept = 0.5) +

  geom_hline(yintercept = national_mu_prior,linetype=2) +

  geom_point(aes(y=pct_clinton,shape=method),alpha=0.3) +

  geom_line(aes(y=mean)) +

  facet_wrap(~state) +

  theme_minimal()  +

  theme(legend.position = 'none') +

  scale_x_date(limits=c(ymd('2016-03-01','2016-11-08')),date_breaks='1 month',date_labels='%b') +

  scale_y_continuous(breaks=seq(0,1,0.02)) + 

  labs(subtitletitle=sprintf('clinton natl pct | mean = %s | p(win) = %s',

                             round(pct_clinton[pct_clinton$state=='--' & pct_clinton$t==election_day,]$mean*100,1),

                             round(pct_clinton[pct_clinton$state=='--' & pct_clinton$t==election_day,]$prob,2)))

natl_evs.2016.gg <-  ggplot(sim_evs, aes(x=t)) +

  geom_hline(yintercept = 270) +

  geom_line(aes(y=median_dem_ev)) +

  geom_ribbon(aes(ymin=low_dem_ev,ymax=high_dem_ev),alpha=0.2) +

  theme_minimal()  +

  theme(legend.position = 'none') +

  scale_x_date(limits=c(ymd('2016-03-01','2016-11-08')),date_breaks='1 month',date_labels='%b') +

  labs(subtitletitle=sprintf('clinton evs | median = %s | p(win) = %s',

                             round(sim_evs[sim_evs$t==election_day,]$median_dem_ev),

                             round(sim_evs[sim_evs$t==election_day,]$prob,2)))

state_polls.2016.gg <- pct_clinton %>%

  filter(state %in% ex_states) %>%

  left_join(df %>% select(state,t,pct_clinton,method)) %>% 

  left_join(tibble(state = names(mu_b_prior),

                   prior = inv.logit(mu_b_prior)) ) %>%

  ggplot(.,aes(x=t,col=state)) +

  geom_ribbon(aes(ymin=low,ymax=high),col=NA,alpha=0.2) +

  geom_hline(yintercept = 0.5) +

  geom_hline(aes(yintercept = prior),linetype=2) +

  geom_point(aes(y=pct_clinton,shape=method),alpha=0.3) +

  geom_line(aes(y=mean)) +

  facet_wrap(~state) +

  theme_minimal()  +

  theme(legend.position = 'top') +

  guides(color='none') +

  scale_x_date(limits=c(ymd('2016-03-01','2016-11-08')),date_breaks='1 month',date_labels='%b') +

  labs(subtitle='pct_clinton state')

# diff from national over time?

state_lean.2016.gg <- pct_clinton[pct_clinton$state != '--',] %>%

  left_join(pct_clinton[pct_clinton$state=='--',] %>%

              select(t,pct_clinton_national=mean), by='t') %>%

  mutate(diff=mean-pct_clinton_national) %>%

  group_by(state) %>%

  mutate(last_prob = last(prob)) %>%

  filter(state %in% ex_states) %>%

  ggplot(.,aes(x=t,y=diff,col=state)) +

  geom_hline(yintercept=0.0) +

  geom_line() +

  geom_label_repel(data = . %>% 

                     filter(t==max(t),

                            state %in% ex_states),

                   aes(label=state)) +

  theme_minimal()  +

  theme(legend.position = 'none') +

  scale_x_date(limits=c(ymd('2016-03-01','2016-11-08')),date_breaks='1 month',date_labels='%b') +

  scale_y_continuous(breaks=seq(-1,1,0.01)) +

  labs(subtitle = identifier)

# brier scores

# https://www.buzzfeednews.com/article/jsvine/2016-election-forecast-grades

compare <- pct_clinton %>% 

  filter(t==max(t),state!='--') %>% 

  select(state,clinton_win=prob) %>% 

  mutate(clinton_win_actual = ifelse(state %in% c('CA','NV','OR','WA','CO','NM','MN','IL','VA','DC','MD','DE','NJ','CT','RI','MA','NH','VT','NY','HI','ME'),1,0),

         diff = (clinton_win_actual - clinton_win )^2) %>% left_join(enframe(ev_state) %>% set_names(.,c('state','ev'))) %>% 

  mutate(ev_weight = ev/(sum(ev))) 

briers.2016 <- tibble(outlet = c('538 polls-plus','538 polls-only','princeton','nyt upshot','kremp/slate','pollsavvy','predictwise markets','predictwise overall','desart and holbrook','daily kos','huffpost'),

       ev_wtd_brier = c(0.0928,0.0936,0.1169,0.1208,0.121,0.1219,0.1272,0.1276,0.1279,0.1439,0.1505),

       unwtd_brier = c(0.0664,0.0672,0.0744,0.0801,0.0766,0.0794,0.0767,0.0783,0.0825,0.0864,0.0892),

       states_correct = c(46,46,47,46,46,46,46,46,44,46,46)) %>% 

  bind_rows(tibble(outlet='economist (backtest)',

                   ev_wtd_brier = weighted.mean(compare$diff, compare$ev_weight),

                   unwtd_brier = mean(compare$diff),

                   states_correct=sum(round(compare$clinton_win) == round(compare$clinton_win_actual)))) %>%

  arrange(ev_wtd_brier) 

# compare model estiamte v actual results

model_v_actual.2016 <- pct_clinton %>% 

  filter(t==max(t),state!='--') %>% 

  left_join(politicaldata::pres_results %>% filter(year == 2016) %>%

              mutate(actual = dem/(dem+rep)) %>%

              dplyr::select(state,actual)) %>%

  filter(state != 'DC') 

model_v_actual.2016.rmse <- sqrt(mean(c((model_v_actual.2016$mean - model_v_actual.2016$actual)^2)))

model_v_actual.2016.gg <- ggplot(model_v_actual.2016,aes(x=mean,y=actual,label=state)) +

  geom_abline() + 

  geom_text() + 

  stat_smooth(method='lm') +

  theme_minimal()

# model vs final polls vs prior

predictcompare.2016.gg <- pct_clinton %>%

  filter(t == max(t),state %in% ex_states) %>%

  mutate(se = (high - mean)/1.68) %>%

  select(state,model_mean=mean,model_se=se) %>%

  left_join(df %>%

              filter(t > (max(t)-14),

                     state %in% ex_states) %>%

              group_by(state) %>%

              summarise(poll = weighted.mean(pct_clinton,n_respondents))) %>%

  left_join(enframe(mu_b_prior,'state','prior') %>%

              mutate(prior = inv.logit(prior))) %>%

  left_join(model_v_actual.2016 %>% dplyr::select(state,actual)) %>%

  ggplot(.,aes(y=state)) +

  geom_point(aes(x=poll,col='poll')) +

  geom_point(aes(x=model_mean,col='model')) +

  geom_point(aes(x=prior,col='prior')) +

  geom_point(aes(x=actual,col='result')) +

  theme_minimal()

# model better than polls?

 pct_clinton %>%

  filter(t == max(t),state %in% ex_states) %>%

  mutate(se = (high - mean)/1.68) %>%

  select(state,model_mean=mean,model_se=se) %>%

  left_join(df %>%

              filter(t > (max(t)-14),

                     state %in% ex_states) %>%

              group_by(state) %>%

              summarise(poll = weighted.mean(pct_clinton,n_respondents))) %>%

  left_join(enframe(mu_b_prior,'state','prior') %>%

              mutate(prior = inv.logit(prior))) %>%

  left_join(model_v_actual.2016 %>% dplyr::select(state,actual)) %>% 

   summarise(poll_error = mean(abs(poll-actual)),

             model_error = mean(abs(model_mean-actual)))

# expected EVs roughly match the stochastic simulations?

final_states <- enframe(mu_b_prior,'state','prior') %>%

  left_join(df %>%

              filter(t > (max(t)-14),) %>%

              group_by(state) %>%

              summarise(poll = weighted.mean(pct_clinton,n_respondents))) %>%

  mutate(avg_poll_less_prior = mean(poll - inv.logit(prior), na.rm=T)) %>%

  mutate(poll = ifelse(is.na(poll),

                       inv.logit(prior) + avg_poll_less_prior,

                       poll)) %>%

  select(state,poll)

# final EV distribution

final_evs <- draws %>%

  left_join(state_elec_results %>% select(state,ev),by='state') %>%

  filter(t==max(t)) %>%

  group_by(draw) %>%

  summarise(dem_ev = sum(ev* (pct_clinton > 0.5)))

final_evs.2016.gg <-  ggplot(final_evs,aes(x=dem_ev,

                       fill=ifelse(dem_ev>=270,'Democratic','Republican'))) +

    geom_vline(xintercept = 270) +

    geom_histogram(binwidth=1) +

    theme_minimal() +

    theme(legend.position = 'top',

          panel.grid.minor = element_blank()) +

    scale_fill_manual(name='Electoral College winner',values=c('Democratic'='#3A4EB1','Republican'='#E40A04')) +

    labs(x='Democratic electoral college votes',title='2016, Potential Electoral College outcomes',

         subtitle=sprintf("p(dem win) = %s | full stan model",round(mean(final_evs$dem_ev>=270),3)))

# get the p value of the actual results

p_values_2016 <- draws %>% 

  filter(state != '--',

         t == election_day) %>% 

  left_join(politicaldata::pres_results %>% filter(year == 2016) %>%

              mutate(actual = dem/(dem+rep)) %>%

              dplyr::select(state,actual)) %>%

  group_by(state) %>%

  summarise(high = quantile(pct_clinton, 0.975),

            low = quantile(pct_clinton, 0.025),

            mean = mean(pct_clinton),

            prob = mean(pct_clinton > 0.5),

            actual = unique(actual),

            p_value = 

           ( 2*sum(pct_clinton < actual) + 1 ) /

           ( 2 * n() + 2) #,

           # p_value_2 = mean(ifelse(actual >= mean(pct_clinton),

           #                         pct_clinton >= actual,

           #                         pct_clinton <= actual))

           ) %>%

  arrange(p_value) %>% 

  mutate(year = 2016,

         outside_ci = ifelse(actual > high | actual < low, TRUE,FALSE))

```

#### Map

```{r echo=F,message=F,warning=F}

map.2016.gg

```

#### Final electoral college histogram

```{r echo=F,message=F,warning=F}

final_evs.2016.gg

```

#### National and state polls and the electoral college over time

```{r echo=F,message=F,warning=F,fig.width = 9, fig.asp = .62}

grid.arrange(natl_polls.2016.gg, natl_evs.2016.gg, state_polls.2016.gg, 

             layout_matrix = rbind(c(1,1,3,3,3),

                                   c(2,2,3,3,3)),

             top = identifier

)

```

#### State vs national deltas over time

```{r echo=F,message=F,warning=F}

state_lean.2016.gg

```

#### Model results vs polls vs the prior

```{r echo=F,message=F,warning=F}

predictcompare.2016.gg

```

#### Performance

````{r echo=F,message=F,warning=F}

knitr::kable(briers.2016)

model_v_actual.2016.gg

model_v_actual.2016.rmse

```

#### Predictions for each state

```{r echo=F,message=F,warning=F}

pct_clinton %>% 

  filter(t == max(t)) %>% 

  mutate(se = (high - mean)/1.68) %>%

  mutate(high = round(high,3),

         low = round(low,3),

         mean = round(mean,3),

         prob = round(prob,3),

         se = round(se,3)) %>%

  select(state,mean,low,high,prob,se) %>%

  arrange(abs(mean-0.5)) %>%

  knitr::kable()

            

```

## Cumulative charts

### Probability calibration plot

```{r echo=F,message=F,warning=F}

calibration_data <- model_v_actual.2016 %>% mutate(year = 2016) %>%

  bind_rows(model_v_actual.2012 %>% mutate(year = 2012)) %>%

  bind_rows(model_v_actual.2008 %>% mutate(year = 2008))

calibration.gg <- ggplot(calibration_data, aes(x=prob,y=ifelse(actual>0.5,1,0),

                                               label = paste0(state,'-',year))) + 

  geom_point() + 

  geom_text_repel(data = calibration_data %>%

                    filter(prob < 0.6 & actual>0.5 | 

                                    prob > 0.4 & actual<0.5 )) + 

  geom_smooth() +

  geom_abline()  +

  scale_x_continuous(breaks=seq(0,1,0.1)) +

  scale_y_continuous(breaks=seq(0,1,0.1)) +

  theme_minimal() + 

  theme(panel.grid.minor = element_blank()) +

  labs(x='Predicted Democratic win probability on election-day',

       y='Result (1 = Dem win, 0 = Rep)')

print(calibration.gg)

```

### Confidence interval coverage

```{r echo=F, message=FALSE, warning=FALSE}

calibration_data <-  p_values_2016 %>% mutate(year = 2016) %>%

        bind_rows(p_values_2012 %>% mutate(year = 2012) ) %>%

        bind_rows(p_values_2008 %>% mutate(year = 2008)) %>%

  filter(state != 'DC')

# sum(calibration_data$outside_ci) / nrow(calibration_data)

p_value_hist.gg <- ggplot(calibration_data, aes(x=p_value)) +

  geom_histogram(binwidth = 0.01,aes(group=paste0(state,year))) +

  geom_vline(xintercept = 0.025,col='red',linetype=2) +

  geom_vline(xintercept = 0.975,col='red',linetype=2) +

  scale_x_continuous(breaks=seq(0,1,0.05))  +

  scale_y_continuous(breaks=seq(0,100,2))  +

  theme_minimal() +

  theme(panel.grid.minor = element_blank())

preds_p_value_scatter.gg <- ggplot(calibration_data, aes(x=mean*100,y=actual*100)) +

  geom_abline() +

  # states inside the ci

  geom_point(data =. %>% filter(!outside_ci),col='black') +

  # states outside

  geom_point(data = . %>% filter(outside_ci),col='red') +

  geom_text_repel(data = . %>% filter(outside_ci),

              aes(label=state),col='red',min.segment.length = 0) +

  geom_segment(data = . %>% filter(outside_ci),

               aes(x=mean*100,xend=mean*100,y=low*100,yend=high*100),col='red') +

  facet_wrap(~year,ncol=3) +

  theme_minimal() +

  theme(panel.grid.minor = element_blank()) +

  labs(

    # title='Model with normal(0, sigma) distributions for polling error and state priors',

    subtitle='Results in highlighted states fall outside our uncertainty intervals',

    x='Predicted Democratic share of the two-party vote',

    y='Actual Democratic share'

    ) +

  coord_cartesian(xlim=c(25,75),ylim=c(25,75))

```

```{r echo=F,message=F,warning=F,fig.width = 9, fig.asp = .42}

print(preds_p_value_scatter.gg)

```

```{r echo=F,message=F,warning=F}

print(p_value_hist.gg)

```

# Licence

This software is published by _[The Economist](https://www.economist.com)_ under the [MIT licence](https://opensource.org/licenses/MIT). The data generated by _The Economist_ are available under the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).

The licences include only the data and the software authored by _The Economist_, and do not cover any _Economist_ content or third-party data or content made available using the software. More information about licensing, syndication and the copyright of _Economist_ content can be found [here](https://www.economist.com/rights/).