{"id":18444897,"url":"https://github.com/spsanderson/healthyverse_tsa","last_synced_at":"2025-04-07T23:33:37.072Z","repository":{"id":39117269,"uuid":"432735245","full_name":"spsanderson/healthyverse_tsa","owner":"spsanderson","description":"A Time Series Analysis of the healthyverse R pacakges","archived":false,"fork":false,"pushed_at":"2024-09-16T13:46:28.000Z","size":207725,"stargazers_count":5,"open_issues_count":1,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-09-16T16:11:25.813Z","etag":null,"topics":["ai","data-analysis","data-analytics","data-science","forecast-on-demand","forecasting","forecasting-models","ml","r","time-series","time-series-analysis"],"latest_commit_sha":null,"homepage":"https://www.spsanderson.com/healthyverse_tsa","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/spsanderson.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-28T14:29:54.000Z","updated_at":"2024-09-16T13:46:33.000Z","dependencies_parsed_at":"2024-01-15T08:53:40.721Z","dependency_job_id":"b1b349bc-b1a0-4c0e-81b7-39b142d67ffb","html_url":"https://github.com/spsanderson/healthyverse_tsa","commit_stats":{"total_commits":425,"total_committers":1,"mean_commits":425.0,"dds":0.0,"last_synced_commit":"3a5a23a005f9955067a8affea7aebb55fa7106b3"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spsanderson%2Fhealthyverse_tsa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spsanderson%2Fhealthyverse_tsa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spsanderson%2Fhealthyverse_tsa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spsanderson%2Fhealthyverse_tsa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/spsanderson","download_url":"https://codeload.github.com/spsanderson/healthyverse_tsa/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223297239,"owners_count":17121977,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","data-analysis","data-analytics","data-science","forecast-on-demand","forecasting","forecasting-models","ml","r","time-series","time-series-analysis"],"created_at":"2024-11-06T07:04:10.289Z","updated_at":"2025-04-07T23:33:37.064Z","avatar_url":"https://github.com/spsanderson.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\ntitle: \"Time Series Analysis and Nested Modeling of the Healthyverse Packages\"\noutput: github_document\nalways_allow_html: true\nauthor: \"Steven P. Sanderson II, MPH - Date: \"\ndate: \"`r format(Sys.time(), '%d %B, %Y')`\"\n---\n\n```{r setup, include=FALSE}\nknitr::opts_chunk$set(\n    echo = TRUE,\n    message = FALSE,\n    warning = FALSE,\n    fig.path = \"man/figures/README-\",\n    fig.width = 12,\n    fig.height = 10\n)\nsource(\"00_scripts/load_libraries.R\")\nsource(\"00_scripts/get_data_functions.R\")\nsource(\"00_scripts/helper_functions.R\")\nsource(\"00_scripts/data_manipulation_functions.R\")\nsource(\"00_scripts/ts_decomp.R\")\n\nn_cores = 7\n```\n\nThis analysis follows a _Nested Modeltime Workflow_.\n\n## Get Data\n\n```{r get_data, echo=FALSE}\nget_cran_data()\nget_package_release_data()\ncsv_to_rds()\ndownloads_tbl \u003c- downloads_processed_tbl()\npkg_tbl \u003c- readRDS(\"01_data/pkg_release_tbl.rds\") %\u003e%\n    mutate(date = as.Date(date))\n```\n\n```{r glimpse_data}\nglimpse(downloads_tbl)\n```\n\nThe last day in the data set is `r max_cran_datetime()`, the file was birthed on:\n`r dl_birth_datetime()`, and at report knit time is `r hours_since_cran_log_update()` \nhours old. `r update_log_message()`\n\nNow that we have our data lets take a look at it using the `skimr` package.\n\n```{r skim_data}\nskim(downloads_tbl)\n```\n\nWe can see that the following columns are missing a lot of data and for us are most\nlikely not useful anyways, so we will drop them `c(r_version, r_arch, r_os)`\n\n```{r data_trimmed, echo=FALSE}\ndata_tbl \u003c- downloads_tbl %\u003e%\n    select(-r_version, -r_arch, -r_os)\n```\n\n## Plots\n\nNow lets take a look at a time-series plot of the total daily downloads by package.\nWe will use a log scale and place a vertical line at each version release for each\npackage.\n\n```{r initial_ts_plot, echo=FALSE}\nmd \u003c- ts_downloads_tbl(data_tbl, \"day\", package) %\u003e% \n  ungroup() %\u003e% \n  select(date) %\u003e% \n  distinct() %\u003e% \n  filter(date == max(date)) %\u003e%\n  pull(date)\n\nlast_values \u003c- ts_downloads_tbl(data_tbl, \"day\", package) %\u003e% \n  ungroup() %\u003e%\n  filter(date == md)\n\nts_downloads_tbl(\n    .data = data_tbl,\n    .by_time = \"day\",\n    package\n) %\u003e%\n  #filter(date \u003e= subtract_time(md, \"1 year\")) %\u003e%\n  ggplot(aes(date, log1p(value))) +\n  theme_bw() +\n  geom_point(aes(group = package, color = package), size = 1) +\n  ggtitle(paste(\"Package Downloads: {healthyverse}\")) +\n  geom_smooth(method = \"loess\", color = \"black\",  se = FALSE) +\n  geom_vline(\n    data = pkg_tbl\n    , aes(xintercept = as.numeric(date))\n    , color = \"red\"\n    , lwd = 1\n    , lty = \"solid\"\n  ) +\n  geom_point(\n    shape = 21, size = 5, color = \"red\",\n    data  = last_values,\n    mapping = aes(x = date, y = log1p(value))\n  ) +\n  facet_wrap(package ~., ncol = 2, scales = \"free_x\") +\n  theme_minimal() +\n  labs(\n    subtitle = \"Vertical lines represent release dates\",\n    x = \"Date\",\n    y = \"log1p(Counts)\",\n    color = \"Package\"\n  ) +\n  theme(legend.position = \"bottom\")\n\nts_downloads_tbl(\n    .data = data_tbl,\n    .by_time = \"day\"\n) %\u003e%\n  select(date, value) %\u003e%\n  summarise_by_time(\n    .date_var = date,\n    .by = \"week\",\n    Actual = (sum(value, na.rm = TRUE))\n  ) %\u003e%\n    mutate(Actual = cumsum(Actual)) %\u003e%\n  tk_augment_differences(.value = Actual, .lags = c(1, 2)) %\u003e%\n  rename(velocity = contains(\"_lag1\")) %\u003e%\n  rename(acceleration = contains(\"_lag2\")) %\u003e%\n  pivot_longer(-date) %\u003e%\n#  filter(date \u003e= subtract_time(md, \"1 year\")) %\u003e%\n  mutate(name = str_to_title(name)) %\u003e%\n  mutate(name = as_factor(name)) %\u003e%\n  ggplot(aes(x = date, y = log1p(value), group = name)) +\n  geom_point(alpha = .2) +\n  geom_line() +\n  geom_vline(\n    data = pkg_tbl\n    , aes(xintercept = date, color = package)\n    , lwd = 1\n    , lty = \"solid\"\n  ) +\n  facet_wrap(name ~ ., ncol = 1, scale = \"free\") +\n  theme_minimal() +\n  labs(\n    title = \"Total Downloads: Trend, Velocity, and Accelertion\",\n    subtitle = \"Vertical Lines Indicate a CRAN Release date for a package.\",\n    x = \"Date\",\n    y = \"Values\",\n    color = \"\"\n  ) +\n  theme(legend.position = \"bottom\")\n```\n\nNow lets take a look at some time series decomposition graphs.\n\n```{r ts_decomp_plt, echo=FALSE}\nplot_stl_diagnostics(\n  .data = ts_downloads_tbl(\n    .data = data_tbl,\n    .by_time = \"week\"\n    ),\n  .date_var = date,\n  .value = log1p(value),\n  .interactive = FALSE\n) +\n  labs(\n    title = \"STL Diagnositcs: log1p(Values) - Weekly Aggregation\"\n  ) +\n  theme_minimal()\n\nplot_stl_diagnostics(\n  .data = ts_downloads_tbl(\n    .data = data_tbl,\n    .by_time = \"month\"\n    ),\n  .date_var = date,\n  .value = log1p(value),\n  .interactive = FALSE\n) +\n  labs(\n    title = \"STL Diagnositcs: log1p(Values) - Monthly Aggregation\"\n  ) +\n  theme_minimal()\n\nplot_seasonal_diagnostics(\n    .data = ts_downloads_tbl(\n    .data = data_tbl,\n    .by_time = \"day\"\n    ),\n  .date_var = date,\n  .value = log1p(value),\n  .interactive = FALSE\n) +\n  labs(\n    title = \"Seasonal Diagnostics: log1p(Values)\"\n  ) +\n  theme_minimal()\n\nplot_acf_diagnostics(\n    .data = ts_downloads_tbl(\n    .data = data_tbl,\n    .by_time = \"day\"\n    ),\n  .date_var = date,\n  .value = log1p(value),\n  .interactive = FALSE\n) +\n  labs(\n    title = \"Lag Diagnostics: log1p(Values)\"\n  ) +\n  theme_minimal()\n```\n\n## Feature Engineering\n\nNow that we have our basic data and a shot of what it looks like, let's add some\nfeatures to our data which can be very helpful in modeling. Lets start by making\na `tibble` that is aggregated by the day and package, as we are going to be interested\nin forecasting the next 4 weeks or 28 days for each package. First lets get our base data.\n\n```{r base_data_frame, echo=FALSE}\nts_downloads_tbl(data_tbl) |\u003e \n    plot_time_series_regression(\n        date, \n        (value) ~ date + lag(value, 1) + lag(value, 7) + lag(value, 14) \n        + lag(value, 21) + lag(value, 28) + lag(value, 35) \n        + lag(value, 42) + lag(value, 49) + month(date, label = TRUE) \n        + fourier_vec(date, type = \"sin\", K = 1, period = 7) \n        + fourier_vec(date, type = \"cos\", K = 1, period = 7), \n        .show_summary = TRUE\n    )\n\nbase_data \u003c- ts_downloads_tbl(\n  .data    = data_tbl,\n  .by_time = \"day\",\n  package\n) %\u003e%\n    #filter(date \u003e= subtract_time(date, \"18 months\")) %\u003e%\n    mutate(package = factor(package)) %\u003e%\n    select(package, date, value)\n```\n\n## NNS Forecasting\n\nThis is something I have been wanting to try for a while. The `NNS` package is a\ngreat package for forecasting time series data.\n\n[NNS GitHub](https://github.com/OVVO-Financial/NNS)\n```{r nns_forecasting, message=FALSE, warning=FALSE}\nlibrary(NNS)\n\ndata_list \u003c- base_data |\u003e\n    select(package, value) |\u003e\n    group_split(package)\n\ndata_list |\u003e\n    imap(\n        \\(x, idx) {\n            obj \u003c- x\n            x \u003c- obj |\u003e pull(value) |\u003e tail(7*52)\n            train_set_size \u003c- length(x) - 56\n            pkg \u003c- obj |\u003e pluck(1) |\u003e unique()\n            sf \u003c- NNS.seas(x, modulo = 7, plot = FALSE)$periods\n            \n            cat(paste0(\"Package: \", pkg, \"\\n\"))\n            NNS.ARMA.optim(\n                variable = x,\n                h = 28,\n                training.set = train_set_size,\n                #seasonal.factor = seq(12, 60, 7),\n                seasonal.factor = sf,\n                pred.int = 0.95,\n                plot = TRUE\n            )\n            title(\n                sub = paste0(\"\\n\",\n                             \"Package: \", pkg, \" - NNS Optimization\")\n            )\n        }\n    )\n```\n\n## Pre-Processing\n\nNow we are going to do some basic pre-processing.\n\n```{r preprocess, message=FALSE}\ndata_padded_tbl \u003c- base_data %\u003e%\n  pad_by_time(\n    .date_var  = date,\n    .pad_value = 0\n  )\n\n# Get log interval and standardization parameters\nlog_params  \u003c- liv(data_padded_tbl$value, limit_lower = 0, offset = 1, silent = TRUE)\nlimit_lower \u003c- log_params$limit_lower\nlimit_upper \u003c- log_params$limit_upper\noffset      \u003c- log_params$offset\n\ndata_liv_tbl \u003c- data_padded_tbl %\u003e%\n  # Get log interval transform\n  mutate(value_trans = liv(value, limit_lower = 0, offset = 1, silent = TRUE)$log_scaled)\n\n# Get Standardization Params\nstd_params \u003c- standard_vec(data_liv_tbl$value_trans, silent = TRUE)\nstd_mean   \u003c- std_params$mean\nstd_sd     \u003c- std_params$sd\n\ndata_transformed_tbl \u003c- data_liv_tbl %\u003e%\n  # get standardization\n  mutate(value_trans = standard_vec(value_trans, silent = TRUE)$standard_scaled) %\u003e%\n  select(-value)\n```\n\n\nSince this is panel data we can follow one of two different modeling strategies. \nWe can search for a global model in the panel data or we can use nested forecasting\nfinding the best model for each of the time series. Since we only have 5 panels, we\nwill use nested forecasting.\n\nTo do this we will use the `nest_timeseries` and `split_nested_timeseries` functions\nto create a nested `tibble`.\n\n```{r nested_data}\nhorizon \u003c- 4*7\n\nnested_data_tbl \u003c- data_transformed_tbl %\u003e%\n    \n    # 1. Extending: We'll predict n days into the future.\n    extend_timeseries(\n        .id_var        = package,\n        .date_var      = date,\n        .length_future = horizon\n    ) %\u003e%\n    \n    # 2. Nesting: We'll group by id, and create a future dataset\n    #    that forecasts n days of extended data and\n    #    an actual dataset that contains n*2 days\n    nest_timeseries(\n        .id_var        = package,\n        .length_future = horizon\n        #.length_actual = horizon*2\n    ) %\u003e%\n    \n   # 3. Splitting: We'll take the actual data and create splits\n   #    for accuracy and confidence interval estimation of n das (test)\n   #    and the rest is training data\n    split_nested_timeseries(\n        .length_test = horizon\n    )\n\nnested_data_tbl\n```\n\nNow it is time to make some recipes and models using the modeltime workflow.\n\n## Modeltime Workflow\n\n### Recipe Object\n\n```{r rec_obj}\nrecipe_base \u003c- recipe(\n  value_trans ~ date\n  , data = extract_nested_test_split(nested_data_tbl)\n  )\n\nrecipe_base\n\nrecipe_date \u003c- recipe_base %\u003e%\n    step_mutate(date = as.numeric(date))\n\n```\n\n### Models\n```{r time_series_models}\n# Models ------------------------------------------------------------------\n\n# Auto ARIMA --------------------------------------------------------------\n\nmodel_spec_arima_no_boost \u003c- arima_reg() %\u003e%\n  set_engine(engine = \"auto_arima\")\n\nwflw_auto_arima \u003c- workflow() %\u003e%\n  add_recipe(recipe = recipe_base) %\u003e%\n  add_model(model_spec_arima_no_boost)\n\n# NNETAR ------------------------------------------------------------------\n\nmodel_spec_nnetar \u003c- nnetar_reg(\n  mode              = \"regression\"\n  , seasonal_period = \"auto\"\n) %\u003e%\n  set_engine(\"nnetar\")\n\nwflw_nnetar \u003c- workflow() %\u003e%\n  add_recipe(recipe = recipe_base) %\u003e%\n  add_model(model_spec_nnetar)\n\n# TSLM --------------------------------------------------------------------\n\nmodel_spec_lm \u003c- linear_reg() %\u003e%\n  set_engine(\"lm\")\n\nwflw_lm \u003c- workflow() %\u003e%\n  add_recipe(recipe = recipe_base) %\u003e%\n  add_model(model_spec_lm)\n\n# MARS --------------------------------------------------------------------\n\nmodel_spec_mars \u003c- mars(mode = \"regression\") %\u003e%\n  set_engine(\"earth\")\n\nwflw_mars \u003c- workflow() %\u003e%\n  add_recipe(recipe = recipe_base) %\u003e%\n  add_model(model_spec_mars)\n\n```\n\n\n### Nested Modeltime Tables\n```{r nested_modeltime_tables}\nnested_modeltime_tbl \u003c- modeltime_nested_fit(\n  # Nested Data\n  nested_data = nested_data_tbl,\n   control = control_nested_fit(\n     verbose = TRUE,\n     allow_par = FALSE\n   ),\n  # Add workflows\n  wflw_auto_arima,\n  wflw_lm,\n  wflw_mars,\n  wflw_nnetar\n)\n\nnested_modeltime_tbl \u003c- nested_modeltime_tbl[!is.na(nested_modeltime_tbl$package),]\n```\n\n### Model Accuracy\n```{r accuracy}\nnested_modeltime_tbl %\u003e%\n  extract_nested_test_accuracy() %\u003e%\n  filter(!is.na(package)) %\u003e%\n  knitr::kable()\n```\n\n### Plot Models\n```{r model_plot}\nnested_modeltime_tbl %\u003e%\n  extract_nested_test_forecast() %\u003e%\n  group_by(package) %\u003e%\n  plot_modeltime_forecast(\n    .interactive = FALSE,\n    .conf_interval_show  = FALSE,\n    .facet_scales = \"free\"\n  ) +\n  theme_minimal() +\n  theme(legend.position = \"bottom\")\n```\n\n### Best Model\n```{r best_model}\nbest_nested_modeltime_tbl \u003c- nested_modeltime_tbl %\u003e%\n  modeltime_nested_select_best(\n    metric = \"rmse\",\n    minimize = TRUE,\n    filter_test_forecasts = TRUE\n  )\n\nbest_nested_modeltime_tbl %\u003e%\n  extract_nested_best_model_report()\n\nbest_nested_modeltime_tbl %\u003e%\n  extract_nested_test_forecast() %\u003e%\n  #filter(!is.na(.model_id)) %\u003e%\n  group_by(package) %\u003e%\n  plot_modeltime_forecast(\n    .interactive = FALSE,\n    .conf_interval_alpha = 0.2,\n    .facet_scales = \"free\"\n  ) +\n  theme_minimal() +\n  theme(legend.position = \"bottom\")\n```\n\n## Refitting and Future Forecast\n\nNow that we have the best models, we can make our future forecasts.\n\n```{r refit}\n\nnested_modeltime_refit_tbl \u003c- best_nested_modeltime_tbl %\u003e%\n    modeltime_nested_refit(\n        control = control_nested_refit(verbose = TRUE)\n    )\n\n\nnested_modeltime_refit_tbl\n\nnested_modeltime_refit_tbl %\u003e%\n  extract_nested_future_forecast() %\u003e%\n  mutate(across(.value:.conf_hi, .fns = ~ standard_inv_vec(\n    x    = .,\n    mean = std_mean,\n    sd   = std_sd\n  )$standard_inverse_value)) %\u003e%\n  mutate(across(.value:.conf_hi, .fns = ~ liiv(\n    x = .,\n    limit_lower = limit_lower,\n    limit_upper = limit_upper,\n    offset      = offset\n  )$rescaled_v)) %\u003e%\n  group_by(package) %\u003e%\n  plot_modeltime_forecast(\n    .interactive = FALSE,\n    .conf_interval_alpha = 0.2,\n    .facet_scales = \"free\"\n  ) +\n  theme_minimal() +\n  theme(legend.position = \"bottom\")\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspsanderson%2Fhealthyverse_tsa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspsanderson%2Fhealthyverse_tsa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspsanderson%2Fhealthyverse_tsa/lists"}