{"id":27072505,"url":"https://github.com/tdaverse/tdarec","last_synced_at":"2025-07-08T20:34:19.078Z","repository":{"id":259553018,"uuid":"869159234","full_name":"tdaverse/tdarec","owner":"tdaverse","description":"recipes + dials extension for persistent homology and vectorizations thereof","archived":false,"fork":false,"pushed_at":"2025-04-03T01:21:08.000Z","size":1209,"stargazers_count":0,"open_issues_count":4,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-03T02:26:01.838Z","etag":null,"topics":["machine-learning","persistent-homology","recipes","tidymodels","topological-data-analysis","vectorization"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tdaverse.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-07T20:20:14.000Z","updated_at":"2025-04-03T01:21:12.000Z","dependencies_parsed_at":"2024-10-26T13:10:18.765Z","dependency_job_id":"d2ccde3f-8498-4a92-be7e-cc30dd5aa4fa","html_url":"https://github.com/tdaverse/tdarec","commit_stats":null,"previous_names":["corybrunson/tdarec","tdaverse/tdarec"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdaverse%2Ftdarec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdaverse%2Ftdarec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdaverse%2Ftdarec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdaverse%2Ftdarec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tdaverse","download_url":"https://codeload.github.com/tdaverse/tdarec/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247411232,"owners_count":20934654,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","persistent-homology","recipes","tidymodels","topological-data-analysis","vectorization"],"created_at":"2025-04-05T23:16:59.711Z","updated_at":"2025-07-08T20:34:19.072Z","avatar_url":"https://github.com/tdaverse.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\"\n)\n```\n\n# tdarec\n\n\u003c!-- badges: start --\u003e\n[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)\n[![CRAN status](https://www.r-pkg.org/badges/version/tdarec)](https://CRAN.R-project.org/package=tdarec)\n\u003c!-- badges: end --\u003e\n\nThe goal of {tdarec} is to provide [{recipes}](https://cran.r-project.org/package=recipes)-style preprocessing steps to compute persistent homology (PH) and calculate vectorizations of persistence diagrams (PDs), and to provide [{dials}](https://cran.r-project.org/package=dials)-style hyperparameter tuners to optimize these steps in ML workflows.\n\nYou can install the development version of tdarec from [GitHub](https://github.com/) with:\n\n``` r\n# install.packages(\"pak\")\npak::pak(\"tdaverse/tdarec\")\n```\n\n## Design\n\n### Recipe steps\n\nThe current version provides two engines to compute PH (more will be implemented; see [this issue](https://github.com/tdaverse/tdarec/issues/2) for plans):\n\n* **Vietoris--Rips** filtrations of point clouds (distance matrices or coordinate matrices) using [{ripserr}](https://github.com/tdaverse/ripserr)\n* **cubical** filtrations of rasters (pixelated or voxelated data) using {ripserr}\n\nAlso included are a pre-processing step to introduce **Gaussian blur** to rasters and a post-processing step to select PDs for **specific homological degrees**.\n\nFinally, this version provides steps that deploy the highly efficient **vectorizations** implemented in [{TDAvec}](https://github.com/uislambekov/TDAvec).\nThese were written with {Rcpp} specifically for ML applications.\n\n### Tunable parameters\n\nMost steps come with new tunable parameters, for example the maximum homological degree of the VR filtration and the number of levels in persistence landscapes.\n\nOne set of parameters that are conspicuously untunable are the \"scale sequences\"---the values at (or intervals over) which each transformed PD is vectorized.\nAn implementation is underway.\n\n### Data formats and sets\n\nWhile the most common {recipes} are designed for structured tabular data, i.e. columns with numeric or categorical entries, almost all data subjected to machine learning with persistent homology has been in forms like point clouds or greyscale images that must be stored in list-columns.\nAll {tdarec} examples use data in this form, and the data installed with the package is pre-processed for such use.\n\n## Example\n\nThis example uses existing engines to optimize a simple classification model for point clouds sampled from different embeddings of the Klein bottle.\nNote also that [{glmnet}](https://cran.r-project.org/package=glmnet) and [{tdaunif}](https://cran.r-project.org/package=tdaunif) must be installed.\n\n### Setup\n\nWhile not required, we attach Tidyverse and Tidymodels for convenience (with messages suppressed):\n\n```{r example packages, message=FALSE}\n# prepare a Tidymodels session and attach {tdarec}\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(tdarec)\n```\n\nThe points are sampled uniformly from one of two Klein bottle embeddings, determined by a coin toss between the flat and the tube.\n\n```{r example samples}\n# generate samples from two embeddings\nset.seed(20024L)\ntibble(embedding = sample(c(\"flat\", \"tube\"), size = 48, replace = TRUE)) |\u003e \n  mutate(sample = lapply(embedding, function(emb) {\n    switch(\n      emb,\n      flat = tdaunif::sample_klein_flat(60, sd = .5),\n      tube = tdaunif::sample_klein_tube(60, sd = .5)\n    )\n  })) |\u003e \n  mutate(embedding = factor(embedding)) |\u003e\n  print() -\u003e klein_data\n```\n\nWe apply a classical partition into 80% training and 20% testing sets and prepare to perform 3-fold cross-validation on the training set.\n\n```{r example partition}\n# partition the data\nklein_split \u003c- initial_split(klein_data, prop = .8)\nklein_train \u003c- training(klein_split)\nklein_test \u003c- testing(klein_split)\nklein_folds \u003c- vfold_cv(klein_train, v = 3L)\n```\n\nIn this example, we adopt a common transformation of persistence diagrams, Euler characteristic curves.\nFor their vectorization, we need a scale sequence that spans the birth and death times of any persistent features, and for this we choose a round number larger than the diameters of both point clouds (based on the sampler documentation) as an upper bound.\nRather than choose _a priori_ to use homology up to degree 0, 1, 2, or 3, we prepare to tune the maximum degree during optimization.\n\n### Specifications\n\nTo prevent the model from using the data set column as a predictor, we assign it a new role, which is preserved by the persistent homology step and ignored by the vectorization step (which outputs new predictor columns).\n\n```{r example recipe}\n# specify a pre-processing recipe\nscale_seq \u003c- seq(0, 3, by = .05)\nrecipe(embedding ~ sample, data = klein_train) |\u003e \n  update_role(sample, new_role = \"data set\") |\u003e \n  step_pd_point_cloud(sample, max_hom_degree = tune(\"vr_degree\")) |\u003e \n  step_vpd_euler_characteristic_curve(sample, xseq = scale_seq) |\u003e \n  print() -\u003e klein_rec\n```\n\nFor simplicity, we choose a common model for ML classification, penalized logistic regression.\nWe fix the mixture coefficient to use LASSO rather than ridge regression but prepare the penalty parameter for tuning. \n\n```{r example model}\n# specify a classification model\nlogistic_reg(penalty = tune(), mixture = 1) |\u003e \n  set_mode(\"classification\") |\u003e \n  set_engine(\"glmnet\") |\u003e \n  print() -\u003e klein_lm\n```\n\nWe then generate a complete hyperparameter tuning grid by crossing the grids generated for the two unspecified parameters:\n\n```{r example grid}\n# generate a hyperparameter tuning grid\nklein_rec_grid \u003c- grid_regular(\n  extract_parameter_set_dials(klein_rec), levels = 3,\n  filter = c(vr_degree \u003e 0)\n)\nklein_lm_grid \u003c- grid_regular(\n  extract_parameter_set_dials(klein_lm), levels = 5\n)\nklein_grid \u003c- merge(klein_rec_grid, klein_lm_grid)\n```\n\n### Optimization\n\nWe evaluate the model across the hyperparameter grid using cross-validation, using the area under the sensitivity--specificity (ROC) curve:\n\n```{r example tune}\n# optimize the model performance\nklein_res \u003c- tune_grid(\n  klein_lm,\n  preprocessor = klein_rec,\n  resamples = klein_folds,\n  grid = klein_grid,\n  metrics = metric_set(roc_auc)\n)\n```\n\nFrom the results, we obtain the best-performing parameter setting:\n\n```{r example metric}\nklein_res |\u003e \n  select_best(metric = \"roc_auc\") |\u003e \n  print() -\u003e klein_best\n```\n\n### Evaluation\n\nThis optimal setting includes both the VR homology degree and the GLM penalty, so both the pre-processing recipe and the predictive model must be finalized in order to fit the final model to the full training set:\n\n```{r example fit}\nklein_rec_fin \u003c- klein_rec |\u003e finalize_recipe(klein_best) |\u003e prep()\nklein_lm_fin \u003c- klein_lm |\u003e finalize_model(klein_best)\nklein_fit \u003c- fit(\n  klein_lm_fin,\n  formula(klein_rec_fin),\n  data = bake(klein_rec_fin, new_data = klein_train)\n)\n```\n\nFinally, we evaluate the fitted model on the testing set:\n\n```{r example evaluate}\nklein_fit |\u003e \n  predict(\n    new_data = bake(klein_rec_fin, new_data = klein_test),\n    type = \"prob\"\n  ) |\u003e \n  bind_cols(select(klein_test, embedding)) |\u003e \n  roc_auc(truth = embedding, .pred_flat)\n```\n\n## Contributions\n\nPlease note that the tdarec project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.\n\n### Generated code\n\nMuch of the code exposing {TDAvec} tools to Tidymodels is generated by elaborate scripts rather than written manually.\nWhile maintenance of these scripts takes effort, it prevents (or at least flags) errors arising from cascading implications of changes to the original functions, and it allows simple and rapid package-wide adjustments. If you see an issue with generated code, please raise an issue to discuss it before submitting a pull request.\n\n### Acknowledgments\n\nThis project was funded by [an ISC grant from the R Consortium](https://r-consortium.org/all-projects/2024-group-1.html#modular-interoperable-and-extensible-topological-data-analysis-in-r) and done in coordination with Aymeric Stamm and with guidance from Bertrand Michel and Paul Rosen.\nIt builds upon the work of and conversations with Umar Islambekov and Aleksei Luchinsky, authors of [{TDAvec}](https://github.com/uislambekov/TDAvec).\nPackage development also benefitted from the support of colleagues in [the Laboratory for Systems Medicine](https://systemsmedicine.pulmonary.medicine.ufl.edu/) and [the TDA Seminar](https://tda.math.ufl.edu/) and the use of equipment at [the University of Florida](https://www.ufl.edu/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftdaverse%2Ftdarec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftdaverse%2Ftdarec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftdaverse%2Ftdarec/lists"}