https://github.com/patrickm663/bnn-sparse-mortality
Some code applying BNNs to sparse mortality data
https://github.com/patrickm663/bnn-sparse-mortality
Last synced: 3 months ago
JSON representation
Some code applying BNNs to sparse mortality data
- Host: GitHub
- URL: https://github.com/patrickm663/bnn-sparse-mortality
- Owner: patrickm663
- License: mit
- Created: 2024-09-18T08:07:53.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-28T09:22:55.000Z (over 1 year ago)
- Last Synced: 2024-10-28T12:12:46.436Z (over 1 year ago)
- Language: Julia
- Size: 126 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# BNNs for Sparse Mortality
# Methodology
## Aim
The aim of the experiments is to benchmark Bayesian Neural Networks to those trained with traditional means (e.g. stochastic gradient descent). We maintain a 'classical' model as a further benchmark.
## Hypothesis
We hypothesise that Bayesian neural networks will have an advantage on smaller datasets than their traditional counterparts. However, the inference process will be too onerous for larger datasets, making the approach unfeasible. As a follow-on, what can be done to improve run-time? Although it can be argued that, in-practice, the model is only expected to be retrained yearly as new data come through, having a very long training time makes it it difficult to tune parameters, since we have both the model architecture (i.e. how many layers should we select), and the inference process (i.e. what priors should we choose) to choose -- which may have a material impact on results.
## Approach
### Data
For consistency, we fix the dataset as the USA 1x1 mortality table for males and females from the Human Mortality Database (mortality.org). The full training set comprises male and female mortality rates from the years 1950-2000, for ages 0-100. This amounts to 2x51x101 samples. The testing set comprises the years 2001-2021. This is what is referred to in the literature as a 'single-population' modelling. Other approaches are multi-population, where data from multiple countries are combined into a single dataset (around 54 countries in the case of Korn et al.).
To make comparisons with other research easier, it may be worth us adjusting the data to match Korn et al. by introducing a validation set on the years 1997-2006 for ages 60-89. This set is used for all in-sample results reported in Korn et al. The models themselves were trained on all available data before 2007 for 54 populations. In the case of a country like Sweden, this would date back to 1751 on HMD! The out-of-sample results are based on the years 2007-2017, for ages 60-89. Notably, log-mortality for ages 50+ exhibit a linear trend, suggesting it is the 'easiest' portion of the data to correctly predict.
The data effectively comprises the feature set year (t), age (x), and an indicator variable for gender (g), and the response variable is the log-transformed mortality (log(\mu)). Year and age are standardised separately using a Z-transformation on the training set's ranges, and the same calibration parameters are applied to the test set (age remains the same, but years move further towards the upper-range of the Z-transformation).
log(\mu) = f(t, x, g)
For the first experiment, we test performance on smaller subsets of the training data. This represents a sparse dataset we may find if data volumes are very low. Note however that this does not add further noise to the data, which we may find if observed deaths are abnormal relative to available exposure. This can be done since HMD provides Exposure and Deaths in a similar format at Mortality. An idea would be to use these subcomponents to generate the sparse data which could then introduce noise as well.
The subsets are as follows: 0.5%, 1%, 5%, 10%, 25%, 50%, 100%. Subsets are drawn randomly (with a constant seed) from the training data using a Bernoulli distribution with p=subset-size. Each subset is then used to train the models. In the case of the Lee-Carter model, it only handles data with regular intervals unless other methods are applied than the standard SVD approach. For that reason, it may be worth only comparing as a 'theoretical' minimum, to demonstrate whether the model outperforms the Lee-Carter (on all the data) with only a fraction of the data.
### Models
In all cases, we take a given architecture (e.g. a feed-forward neural network with a (3 => 8 => 8 => 8 => 1) architecture and swish hidden activation functions), and train one version using stochastic gradient descent (e.g. ADAM) and the other with MCMC (e.g. NUTS). This is for consistency to make comparisons fairer. It should be noted that the NN architecture is subject to change and likely isn't the best choice, but anecdotally, a 'square' architecture seems to generalise well.
The following models are being compared:
- Feed-forward neural network. Forecasts are generated by extending the year range as input
- Recurrent neural network. Forecasts are generated by forecasting one year ahead and applying recursion to produce forecasts for later years (i.e. the output one-step-ahead becomes part of the input for the next-step-ahead). This is consistent with Korn et al.
- Lee-Carter. Forecasts are generated by fitting an AR(1) model to the Kappa(t) terms and reapplying the Lee-Carter formula. It has been shown in the literature that forecasting Kappa(t) could be done by a neural network instead. In addition, since Lee-Carter cannot take gender as an input variable, the model would need to be calibrated for males and females separately.
The RNN however needs to still be trained. It remains to be seen how it handles 'gaps' in the dataset, since it may be expecting regularly spread data like Lee-Carter. This may require further discussion.
All BNNs are trained using the NUTS algorithm at an 0.9 acceptance rate. 0.9 is chosen because it only favours good-quality samples which should lead to a more accurate model, and NUTS is a state-of-the-art HMC implementation. A downside is run-time. Since it appears that the amount of samples required needs to scale with the size of the data, the following step-up method is proposed: 0.5% = 2 500, 1% = 5 000, 5% = 7 500, 10% = 10 000, 25%+ = 15 000. I think though that after 25%, the gains from using a BNN will be heavily outweighed by its run-time, so fitting a traditional NN and creating prediction intervals may be better-suited. This could mean that for larger data, a special discussion is needed to analyse alternative samplers like SGHMC or SGLD, or variational inference rather than NUTS. As a further alternative, a 'pseudo-BNN' could be fitted by using a pretrained NN and only selecting some of the parameters as trainable using Bayesian methods -- for example, have the first layer be 'Bayesian' and the rest fixed, so then randomness can propogate from the first layer rather.
### Prediction Intervals
Once trained, the Bayesian models have posterior samples for each parameter. We first calculate the MAP by finding the sample that produced the maximum log-posterior over all samples. This is used as a foremost comparison to the traditional NN. Then, we generate 100 000 samples from the posterior distribution of the parameters and reconstruct a NN per sample (i.e. 100 000 NNs). For each NN, we generate an output for the given data (in-sample and out-of-sample). This results in 100 000 outputs. We then take the mean, median, 5% and 95% outputs. The mean and median (along with the MAP) are compared to the traditional NN on the error measures outlined below. The 5-95% prediction interval is used to calculate the PI measures outlined below, and for plotting purposes. Interestingly, this is very quick and is not a bottle-neck.
Note: when fitting the BNN, we also estimate the variance term, since the likelihood is assumed to be Normally distributed around the model's output. We can possibly combine the two variances (as done below) to produce a total variance. This should mean the prediction intervals widen. As a proxy for applying total variance, we could take the MAP estimate for the variance term and have each output generated above be subject to variance. The MAP of the variance term appears to be about 0.05^2, and is fairly consistent across subset sizes. Current experiment set-up has the variance around the likelihood as a constant. This could be adapted if we want to e.g. increase variance with age.
The traditional NNs get their prediction intervals using the bootstrap methods to estimate model uncertainty, and the mean-variance-estimator (MVE) technique to estimate noise variance. MVE uses a secondary FNN trained on the residuals via a particular cost function. The cost function also takes as input the original NN. Together, they produce the total variance. This is used as input to a Normal distribution where the mean is the original NN and the variance is the aforementioned total variance. An alternative is to use MCMC to estimate the model uncertainty around a fixed NN. This could be viewed as a pseudo-BNN since only the variance around the outputs is estimated.
The Lee-Carter's prediction intervals are generated in the forecasting step via the AR(1) model. Forecasts are repeatedly drawn and the prediction interval is sampled after repeatedly constructing the Lee-Carter model. A further noise variance can also be obtained via the residuals on the in-sample data. This is a constant generally in the literature and is usually omitted when analysing in-sample results (since they are done on an expectation basis), but we can include it in our analysis for consistency.
### Performance
Models are evaluated in-sample and out-of-sample. For in-sample data, the full training data is used to evaluate how well the model interpolates.
The following metrics are used:
- Error measures:
- MSE
- RMSE
- Training run-time (mainly for MCMC)
- Prediction intervals (PI)
- PICP (PI coverage probability)
- MPIW (Mean PI width)
Further measures worth analysing for consistency with Korn is the MAE, MAPE, MdAPE (median absolute percentage error), and mean Poisson deviance. The latter requires the corresponding Death counts.
The overall results would be looked at over subset-sizes, comparing in vs out-of-sample performance and approach to training (traditional vs BNN). It is expected that large subsets of the data will reach a performance bottle-neck, where it becomes unfeasible to use a BNN. Also, since there is more data, the benefit further diminishes.