{"id":34743189,"url":"https://github.com/donishadsmith/vswift","last_synced_at":"2026-04-21T12:33:19.625Z","repository":{"id":154426364,"uuid":"621923152","full_name":"donishadsmith/vswift","owner":"donishadsmith","description":"A R package for evaluating ML classification models.","archived":false,"fork":false,"pushed_at":"2026-04-06T00:40:43.000Z","size":1239,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-06T02:28:31.132Z","etag":null,"topics":["classification","cross-validation","data-science","machine-learning","model-evaluation","r"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/donishadsmith.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-03-31T17:25:02.000Z","updated_at":"2026-04-06T00:40:48.000Z","dependencies_parsed_at":"2023-12-21T06:23:06.290Z","dependency_job_id":"c62834cb-d07b-48c7-ad0f-44bcbdab4354","html_url":"https://github.com/donishadsmith/vswift","commit_stats":null,"previous_names":[],"tags_count":36,"template":false,"template_full_name":null,"purl":"pkg:github/donishadsmith/vswift","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/donishadsmith%2Fvswift","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/donishadsmith%2Fvswift/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/donishadsmith%2Fvswift/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/donishadsmith%2Fvswift/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/donishadsmith","download_url":"https://codeload.github.com/donishadsmith/vswift/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/donishadsmith%2Fvswift/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32091889,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-21T11:25:29.218Z","status":"ssl_error","status_checked_at":"2026-04-21T11:25:28.499Z","response_time":128,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","cross-validation","data-science","machine-learning","model-evaluation","r"],"created_at":"2025-12-25T04:27:05.716Z","updated_at":"2026-04-21T12:33:19.411Z","avatar_url":"https://github.com/donishadsmith.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# vswift\r\n![R Versions](https://img.shields.io/badge/R-4.3%20%7C%204.4%20%7C%204.5-blue)\r\n[![Test Status](https://github.com/donishadsmith/vswift/actions/workflows/testing.yaml/badge.svg)](https://github.com/donishadsmith/vswift/actions/workflows/testing.yaml)\r\n[![Codecov](https://codecov.io/github/donishadsmith/vswift/graph/badge.svg?token=7DYAPU2M0G)](https://codecov.io/github/donishadsmith/vswift)\r\n[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\r\n\r\nvswift provides a unified interface to multiple classification algorithms from \r\npopular R packages for performing model evaluation on classification tasks\r\n(binary and multi-class).\r\n\r\n## Supported Classification Algorithms\r\nThe following classification algorithms are available through their respective\r\nR packages:\r\n\r\n  - `lda` from MASS package for Linear Discriminant Analysis\r\n  - `qda` from MASS package for Quadratic Discriminant Analysis\r\n  - `glm` from base package with `family = \"binomial\"` for Unregularized\r\n  Logistic Regression\r\n  - `glmnet` from `glmnet` package with `family = \"binomial\"` or\r\n  `family = \"multinomial\"`and using `cv.glmnet` to select the optimal lambda for\r\n  Regularized Logistic Regression and Regularized Multinomial Logistic Regression.\r\n  - `svm` from e1071 package for Support Vector Machine\r\n  - `naive_bayes` from naivebayes package for Naive Bayes\r\n  - `nnet` from nnet package for Neural Network\r\n  - `train.kknn` from kknn package for K-Nearest Neighbors\r\n  - `rpart` from rpart package for Decision Trees\r\n  - `randomForest` from randomForest package for Random Forest\r\n  - `multinom` from nnet package for Unregularized Multinomial Logistic\r\n  Regression\r\n  - `xgb.train` from xgboost package for Extreme Gradient Boosting\r\n\r\n## Features\r\n\r\n### Data Handling\r\n- **Versatile Data Splitting**: Perform train-test splits or cross-validation\r\non your classification data.\r\n- **Stratified Sampling Option**: Ensure representative class distribution\r\nusing stratified sampling based on class proportions.\r\n- **Handling Unseen Categorical Levels**: Automatically exclude observations\r\nfrom the validation/test set with categories not seen during model training.\r\n\r\n### Model Configuration\r\n- **Support for Popular Algorithms**: Choose from a wide range of classification\r\nalgorithms. Multiple algorithms can be specified in a single function call.\r\n- **Model Saving Capabilities**: Save all models utilized for training and\r\ntesting for both train-test splitting and cross-validation.\r\n- **Final Model Creation**: Easily create and save final models for future use.\r\n- **Dataset Saving Options**: Preserve split datasets and folds for\r\nreproducibility.\r\n- **Parallel Processing**: Utilize multi-core processing for cross-validation\r\nthrough the future package, configurable via `n_cores` and `future.seed` keys\r\nin the `parallel_configs` parameter.\r\n\r\n### Data Preprocessing\r\n- **Missing Data Imputation**: Select either Bagged Tree Imputation or KNN\r\nImputation, implemented using the recipes package. Imputation only uses feature\r\ndata (specifically observations where not all features are missing) from the\r\ntraining set to prevent leakage.\r\n- **Automatic Numerical Encoding**: Target variable classes are automatically\r\nencoded numerically for algorithms requiring numerical inputs.\r\n\r\n### Model Evaluation\r\n- **Comprehensive Metrics**: Generate and save performance metrics including\r\nclassification accuracy, precision, recall, and F1 for each class. For binary\r\nclassification tasks, produce ROC (Receiver Operating Characteristic) and PR\r\n(Precision-Recall) curves and calculate AUC (Area Under Curve) scores.\r\n\r\n## Installation\r\n\r\n### From the \"main\" branch\r\n\r\n```R\r\n# Install 'devtools' to install packages from Github\r\ninstall.packages(\"devtools\")\r\n\r\n# Install 'vswift' package\r\ndevtools::install_github(\"donishadsmith/vswift\", build_manual = TRUE, build_vignettes = TRUE)\r\n \r\n# Display documentation for the 'vswift' package\r\nhelp(package = \"vswift\")\r\n```\r\n\r\n### Github release\r\n\r\n```R\r\n# Install 'vswift' package\r\ninstall.packages(\r\n  \"https://github.com/donishadsmith/vswift/releases/download/0.6.2/vswift_0.6.2.tar.gz\",\r\n  repos = NULL,\r\n  type = \"source\"\r\n)\r\n\r\n# Display documentation for the 'vswift' package\r\nhelp(package = \"vswift\")\r\n```\r\n## Usage\r\n\r\nThe type of classification algorithm is specified using the `models` parameter in the `class_cv` function.\r\n\r\nAcceptable inputs for the `models` parameter includes:\r\n\r\n  - \"lda\" for Linear Discriminant Analysis\r\n  - \"qda\" for Quadratic Discriminant Analysis\r\n  - \"logistic\" for Unregularized Logistic Regression\r\n  - \"regularized_logistic\" for Regularized Logistic Regression\r\n  - \"svm\" for Support Vector Machine\r\n  - \"naivebayes\" for Naive Bayes\r\n  - \"nnet\" for Neural Network \r\n  - \"knn\" for K-Nearest Neighbors\r\n  - \"decisiontree\" for Decision Trees\r\n  - \"randomforest\" for Random Forest\r\n  - \"multinom\" for Unregularized Multinomial Logistic Regression\r\n  - \"regularized_multinomial\" for Regularized Multinomial Logistic Regression\r\n  - \"xgboost\" for Extreme Gradient Boosting\r\n\r\n### Using a single model\r\n\r\n*Note*: This example uses the [Differentiated Thyroid Cancer Recurrence data from the UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/915/differentiated+thyroid+cancer+recurrence). Additionally,\r\nif stratification is requested and one of the regularized models is used, then stratification will also be performed\r\non the training data used for `cv.glmnet`. In this case, the `foldid` parameter in `cv.glmnet` will be used to retain\r\nthe relative proportions in the target variable.\r\n\r\n```R\r\n# Set url for Thyroid Recurrence data from UCI Machine Learning Repository. This data has 383 instances and 16 features\r\nurl \u003c- \"https://archive.ics.uci.edu/static/public/915/differentiated+thyroid+cancer+recurrence.zip\"\r\n\r\n# Set file destination\r\ndest_file \u003c- file.path(getwd(), \"thyroid.zip\")\r\n\r\n# Download zip file\r\ndownload.file(url, dest_file)\r\n\r\n# Unzip file\r\nunzip(zipfile = dest_file, files = \"Thyroid_Diff.csv\")\r\n\r\nthyroid_data \u003c- read.csv(\"Thyroid_Diff.csv\")\r\n\r\n# Load the package\r\nlibrary(vswift)\r\n\r\n# Model arguments; nfolds is the number of folds for `cv.glmnet`\r\nmap_args \u003c- list(regularized_logistic = list(alpha = 1, nfolds = 3))\r\n\r\n# Perform train-test split and cross-validation with stratified sampling\r\nresults \u003c- class_cv(\r\n  data = thyroid_data,\r\n  formula = Recurred ~ .,\r\n  models = \"regularized_logistic\",\r\n  model_params = list(\r\n    map_args = map_args,\r\n    rule = \"1se\", # rule can be \"min\" or \"1se\"\r\n    verbose = TRUE\r\n  ),\r\n  train_params = list(\r\n    split = 0.8,\r\n    n_folds = 5,\r\n    standardize = TRUE,\r\n    stratified = TRUE,\r\n    random_seed = 123\r\n  ),\r\n  save = list(models = TRUE) # Saves both `cv.glmnet` and `glmnet` model\r\n)\r\n```\r\n\r\n\u003cdetails\u003e\r\n\r\n\u003csummary\u003e\u003cstrong\u003eOutput Message\u003c/strong\u003e\u003c/summary\u003e\r\n\r\n```\r\nModel: regularized_logistic | Partition: Train-Test Split | Optimal lambda: 0.09459 (nested 3-fold cross-validation using '1se' rule) \r\nModel: regularized_logistic | Partition: Fold 1 | Optimal lambda: 0.00983 (nested 3-fold cross-validation using '1se' rule) \r\nModel: regularized_logistic | Partition: Fold 2 | Optimal lambda: 0.07949 (nested 3-fold cross-validation using '1se' rule) \r\nModel: regularized_logistic | Partition: Fold 3 | Optimal lambda: 0.01376 (nested 3-fold cross-validation using '1se' rule) \r\nModel: regularized_logistic | Partition: Fold 4 | Optimal lambda: 0.00565 (nested 3-fold cross-validation using '1se' rule) \r\nModel: regularized_logistic | Partition: Fold 5 | Optimal lambda: 0.01253 (nested 3-fold cross-validation using '1se' rule)\r\n```\r\n\r\n\u003c/details\u003e\r\n\r\nPrint optimal lambda values.\r\n```R\r\nresults$metrics(\"regularized_logistic\", \"optimal_lambdas\")\r\n```\r\n\r\n**Output**\r\n```\r\n      split       fold1       fold2       fold3       fold4       fold5 \r\n0.094590537 0.009834647 0.079494739 0.013763132 0.005649260 0.012525544 \r\n```\r\n\r\n```R\r\n# Quick summary\r\nresults$summary()\r\n```\r\n\r\n**Output**\r\n```\r\nClassification Results\r\n-----------------------------\r\n  Models:   regularized_logistic \r\n  Classes:  No, Yes \r\n  Split:    0.8 (Training), 0.2 (Test) \r\n  Folds:    5 \r\n\r\n  Mean Classification Accuracy (Train-Test Split):\r\n    Regularized Logistic Regression 0.928 (Training),  0.910 (Test)\r\n\r\n  Mean Classification Accuracy (CV):\r\n    Regularized Logistic Regression 0.948\r\n```\r\n\r\n```R\r\n# Print parameter information and model evaluation metrics\r\nresults$print(configs = TRUE, metrics = TRUE)\r\n```\r\n\r\n**Output**\r\n\r\n```\r\n - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \r\n\r\n\r\nModel: Regularized Logistic Regression \r\n\r\nFormula: Recurred ~ .\r\n\r\nNumber of Features: 16\r\n\r\nClasses: No, Yes\r\n\r\nTraining Parameters: list(split = 0.8, n_folds = 5, stratified = TRUE, random_seed = 123, standardize = TRUE, remove_obs = FALSE)\r\n\r\nModel Parameters: list(map_args = list(regularized_logistic = list(alpha = 1, nfolds = 3)), threshold = NULL, rule = \"1se\", final_model = FALSE, verbose = TRUE)\r\n\r\nUnlabeled Observations: 0\r\n\r\nIncomplete Labeled Observations: 0\r\n\r\nObservations Missing All Features: 0\r\n\r\nSample Size (Complete Observations): 383\r\n\r\nImputation Parameters: list(method = NULL, args = NULL)\r\n\r\nParallel Configs: list(n_cores = NULL, future.seed = NULL)\r\n\r\n\r\n\r\nTraining\r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\n\r\nClassification Accuracy:  0.93 \r\n\r\nClass:   Precision:  Recall:       F1:\r\n\r\nNo             0.91     1.00      0.95 \r\nYes            0.98     0.76      0.86 \r\n\r\n\r\nTest \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\n\r\nClassification Accuracy:  0.91 \r\n\r\nClass:   Precision:  Recall:       F1:\r\n\r\nNo             0.89     1.00      0.94 \r\nYes            1.00     0.68      0.81 \r\n\r\n\r\nCross-validation (CV) \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\n\r\nAverage Classification Accuracy:  0.95 ± 0.03 (SD) \r\n\r\nClass:       Average Precision:        Average Recall:            Average F1:\r\n\r\nNo             0.94 ± 0.04 (SD)       0.99 ± 0.01 (SD)       0.96 ± 0.02 (SD) \r\nYes            0.97 ± 0.03 (SD)       0.84 ± 0.12 (SD)       0.90 ± 0.06 (SD) \r\n\r\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \r\n```\r\n\r\n```R\r\n# Plot model evaluation metrics\r\nresults$plot(split = TRUE, cv = TRUE, path = getwd())\r\n```\r\n\r\n\u003cdetails\u003e\r\n  \r\n  \u003csummary\u003e\u003cstrong\u003ePlots\u003c/strong\u003e\u003c/summary\u003e\r\n  \r\n  ![image](assets/thyroid/regularized_logistic_regression_cv_classification_accuracy.png)\r\n  ![image](assets/thyroid/regularized_logistic_regression_cv_f1_No.png)\r\n  ![image](assets/thyroid/regularized_logistic_regression_cv_f1_Yes.png)\r\n  ![image](assets/thyroid/regularized_logistic_regression_cv_precision_No.png)\r\n  ![image](assets/thyroid/regularized_logistic_regression_cv_precision_Yes.png)\r\n  ![image](assets/thyroid/regularized_logistic_regression_cv_recall_No.png)\r\n  ![image](assets/thyroid/regularized_logistic_regression_cv_recall_Yes.png)\r\n  ![image](assets/thyroid/regularized_logistic_regression_train_test_classification_accuracy.png)\r\n  ![image](assets/thyroid/regularized_logistic_regression_train_test_f1_No.png)\r\n  ![image](assets/thyroid/regularized_logistic_regression_train_test_f1_Yes.png)\r\n  ![image](assets/thyroid/regularized_logistic_regression_train_test_precision_No.png)\r\n  ![image](assets/thyroid/regularized_logistic_regression_train_test_precision_Yes.png)\r\n  ![image](assets/thyroid/regularized_logistic_regression_train_test_recall_No.png)\r\n  ![image](assets/thyroid/regularized_logistic_regression_train_test_recall_Yes.png)\r\n\r\n\u003c/details\u003e\r\n\r\n\r\n### Producing ROC and PR Curves with AUC scores\r\nROC and PR curves are only available for binary classification tasks. To generate either curve, the models must be\r\nsaved.\r\n\r\n```R\r\n# Can use `target` parameter, which accepts characters and integers instead of `formula`\r\nresults \u003c- class_cv(\r\n  data = thyroid_data,\r\n  target = \"Recurred\", # Using 17, the column index of \"Recurred\" is also valid\r\n  models = \"naivebayes\",\r\n  train_params = list(\r\n    split = 0.8,\r\n    n_folds = 5,\r\n    standardize = TRUE,\r\n    stratified = TRUE,\r\n    random_seed = 123\r\n  ),\r\n  save = list(models = TRUE)\r\n)\r\n```\r\n\r\nOutput consists of a `CurveResult` object containing thresholds used to generate the ROC, target labels, False Positive Rates (FPR), True Positive Rates (TPR)/Recall, Area Under The Curve (AUC), and Youden's Index for all training and validation sets for each model. For the PR curve, the outputs replace the FPR with Precision and Youden's Index with the maximum F1 score and its associated optimal threshold.\r\n\r\n```R\r\n# Will derive thresholds from the probabilities\r\nroc_output \u003c- results$roc_curve(\r\n  data = thyroid_data,\r\n  return_output = TRUE,\r\n  thresholds = NULL,\r\n  path = getwd()\r\n)\r\n\r\npr_output \u003c- results$pr_curve(\r\n  data = thyroid_data,\r\n  return_output = TRUE,\r\n  thresholds = NULL,\r\n  path = getwd()\r\n)\r\n```\r\n\r\n**Output**\r\n\r\n```\r\nWarning message:\r\nIn .create_dictionary(x$classes, TRUE) :\r\n  creating keys for target variable for `rocCurve`;\r\n  classes are now encoded: No = 0, Yes = 1\r\n  \r\nWarning message:\r\nIn .create_dictionary(x$classes, TRUE) :\r\n  creating keys for target variable for `prCurve`;\r\n  classes are now encoded: No = 0, Yes = 1\r\n```\r\n\r\n![image](assets/curves/naivebayes_train_test_roc_curve.png)\r\n![image](assets/curves/naivebayes_cv_roc_curve.png)\r\n![image](assets/curves/naivebayes_train_test_precision_recall_curve.png)\r\n![image](assets/curves/naivebayes_cv_precision_recall_curve.png)\r\n\r\n\r\nAccess curve results using the `CurveResult` methods:\r\n\r\n```R\r\n# Get AUC for a specific model and partition\r\nroc_output$get_auc(\"naivebayes\", \"split\", \"test\")\r\n\r\n# Get probabilities\r\nroc_output$get_probs(\"naivebayes\", \"split\", \"train\")\r\n\r\n# Get curve metrics (FPR/TPR for ROC, precision/recall for PR)\r\nroc_output$get_metrics(\"naivebayes\", \"split\", \"test\")\r\n\r\n# Get optimal threshold (Youden's Index for ROC, max F1 threshold for PR)\r\nroc_output$get_optimal_threshold(\"naivebayes\", \"split\", \"test\")\r\n\r\n# Compare AUC across all models\r\nroc_output$compare(\"split\", \"test\")\r\n```\r\n\r\n\u003cdetails\u003e\r\n\r\n\u003csummary\u003e\u003cstrong\u003eOutput\u003c/strong\u003e\u003c/summary\u003e\r\n\r\n\u003c/details\u003e\r\n\r\nOptimal thresholds values can be used as input for `class_cv` to assess the performance when using a specific threshold.\r\n\r\n```R\r\n# Get average Youden's Index across folds\r\nnb_results \u003c- roc_output$get_model(\"naivebayes\")\r\navg_youdens_indx \u003c- mean(sapply(nb_results$cv, function(x) x$youdens_indx))\r\n\r\n# Using 17, the column index of \"Recurred\"\r\nresults \u003c- class_cv(\r\n  data = thyroid_data,\r\n  target = 17,\r\n  models = \"naivebayes\",\r\n  model_params = list(\r\n    threshold = avg_youdens_indx\r\n  ),\r\n  train_params = list(\r\n    n_folds = 5,\r\n    standardize = TRUE,\r\n    stratified = TRUE,\r\n    random_seed = 123\r\n  ),\r\n  save = list(models = TRUE)\r\n)\r\n\r\nresults$print()\r\n```\r\n\r\n\r\n\u003cdetails\u003e\r\n\r\n\u003csummary\u003e\u003cstrong\u003eOutput\u003c/strong\u003e\u003c/summary\u003e\r\n\r\n```\r\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \r\n\r\n\r\nModel: Naive Bayes \r\n\r\nFormula: c(Recurred ~ Age + Gender + Smoking + Hx.Smoking + Hx.Radiothreapy + ,  Thyroid.Function + Physical.Examination + Adenopathy + Pathology + ,  Focality + Risk + T + N + M + Stage + Response)\r\n\r\nNumber of Features: 16\r\n\r\nClasses: No, Yes\r\n\r\nTraining Parameters: list(split = NULL, n_folds = 5, stratified = TRUE, random_seed = 123, standardize = TRUE, remove_obs = FALSE)\r\n\r\nModel Parameters: list(map_args = NULL, threshold = 0.446228154420309, final_model = FALSE)\r\n\r\nUnlabeled Observations: 0\r\n\r\nIncomplete Labeled Observations: 0\r\n\r\nObservations Missing All Features: 0\r\n\r\nSample Size (Complete Observations): 383\r\n\r\nImputation Parameters: list(method = NULL, args = NULL)\r\n\r\nParallel Configs: list(n_cores = NULL, future.seed = NULL)\r\n\r\n\r\n\r\nCross-validation (CV) \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\n\r\nAverage Classification Accuracy:  0.92 ± 0.03 (SD) \r\n\r\nClass:       Average Precision:        Average Recall:            Average F1:\r\n\r\nNo             0.95 ± 0.01 (SD)       0.93 ± 0.03 (SD)       0.94 ± 0.02 (SD) \r\nYes            0.84 ± 0.07 (SD)       0.88 ± 0.03 (SD)       0.86 ± 0.04 (SD) \r\n\r\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \r\n```\r\n\u003c/details\u003e\r\n\r\n\r\n### Impute Incomplete Labeled Data\r\n\r\nAvailable options includes \"impute_bag\" and \"impute_knn\". Both methods use the recipe package for implementation.\r\n\r\n```R\r\nset.seed(0)\r\n\r\n# Introduce some missing data\r\nfor (i in 1:ncol(thyroid_data)) {\r\n  thyroid_data[sample(1:nrow(thyroid_data), size = round(nrow(thyroid_data) * .01)), i] \u003c- NA\r\n}\r\n\r\nresults \u003c- class_cv(\r\n  formula = Recurred ~ .,\r\n  data = thyroid_data,\r\n  models = \"randomforest\",\r\n  train_params = list(\r\n    split = 0.8,\r\n    n_folds = 5,\r\n    stratified = TRUE,\r\n    random_seed = 123,\r\n    standardize = TRUE\r\n  ),\r\n  impute_params = list(method = \"impute_bag\", args = list(trees = 20, seed_val = 123)),\r\n  model_params = list(final_model = FALSE),\r\n  save = list(models = FALSE, data = FALSE)\r\n)\r\n                   \r\nresults$print()\r\n```\r\n\r\n\r\n\u003cdetails\u003e\r\n\r\n\u003csummary\u003e\u003cstrong\u003eOutput\u003c/strong\u003e\u003c/summary\u003e\r\n\r\n```\r\nWarning messages:\r\n1: In .clean_data(data, missing_info, !is.null(impute_params$method)) :\r\n  dropping 8 unlabeled observations\r\n2: In .clean_data(data, missing_info, !is.null(impute_params$method)) :\r\n  110 labeled observations are missing data in one or more features and will be imputed\r\n\r\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \r\n\r\n\r\nModel: Random Forest \r\n\r\nFormula: Recurred ~ .\r\n\r\nNumber of Features: 16\r\n\r\nClasses: No, Yes\r\n\r\nTraining Parameters: list(split = 0.8, n_folds = 5, stratified = TRUE, random_seed = 123, standardize = TRUE, remove_obs = FALSE)\r\n\r\nModel Parameters: list(map_args = NULL, threshold = NULL, final_model = FALSE)\r\n\r\nUnlabeled Observations: 8\r\n\r\nIncomplete Labeled Observations: 110\r\n\r\nObservations Missing All Features: 0\r\n\r\nSample Size (Complete + Imputed Incomplete Labeled Observations): 375\r\n\r\nImputation Parameters: list(method = \"impute_bag\", args = list(trees = 20, seed_val = 123))\r\n\r\nParallel Configs: list(n_cores = NULL, future.seed = NULL)\r\n\r\n\r\n\r\nTraining \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\n\r\nClassification Accuracy:  1.00 \r\n\r\nClass:   Precision:  Recall:       F1:\r\n\r\nNo             1.00     1.00      1.00 \r\nYes            1.00     0.99      0.99 \r\n\r\n\r\nTest \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\n\r\nClassification Accuracy:  0.96 \r\n\r\nClass:   Precision:  Recall:       F1:\r\n\r\nNo             0.98     0.96      0.97 \r\nYes            0.91     0.95      0.93 \r\n\r\n\r\nCross-validation (CV) \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\n\r\nAverage Classification Accuracy:  0.97 ± 0.01 (SD) \r\n\r\nClass:       Average Precision:        Average Recall:            Average F1:\r\n\r\nNo             0.97 ± 0.01 (SD)       0.98 ± 0.01 (SD)       0.98 ± 0.01 (SD) \r\nYes            0.95 ± 0.03 (SD)       0.92 ± 0.03 (SD)       0.94 ± 0.01 (SD) \r\n\r\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \r\n```\r\n\r\n\u003c/details\u003e\r\n\r\n\r\n### Using Parallel Processing\r\n\r\nParallel processing operates at the fold level, which means the system can simultaneously process multiple cross-validation folds (and the train-test split) even when training a single model.\r\n\r\n*Note*: This example uses the [Internet Advertisements data from the UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/51/internet+advertisements).\r\n\r\n```R\r\nset.seed(NULL)\r\n\r\n# Set url for Internet Advertisements data from UCI Machine Learning Repository. This data has 3,278 instances and 1558 features.\r\nurl \u003c- \"https://archive.ics.uci.edu/static/public/51/internet+advertisements.zip\"\r\n\r\n# Set file destination\r\ndest_file \u003c- file.path(getwd(), \"ad.zip\")\r\n\r\n# Download zip file\r\ndownload.file(url, dest_file)\r\n\r\n# Unzip file\r\nunzip(zipfile = dest_file, files = \"ad.data\")\r\n\r\n# Read data\r\nad_data \u003c- read.csv(\"ad.data\")\r\n\r\n# Load in vswift\r\nlibrary(vswift)\r\n\r\n# Create arguments variable to tune parameters for multiple models\r\nmap_args \u003c- list(\r\n  \"knn\" = list(ks = 5),\r\n  \"xgboost\" = list(\r\n    params = list(\r\n      booster = \"gbtree\",\r\n      objective = \"reg:logistic\",\r\n      lambda = 0.0003,\r\n      alpha = 0.0003,\r\n      eta = 0.8,\r\n      max_depth = 6\r\n    ),\r\n    nrounds = 10\r\n  )\r\n)\r\n\r\nprint(\"Without Parallel Processing:\")\r\n\r\n# Obtain new start time\r\nstart \u003c- proc.time()\r\n\r\n# Run the same model without parallel processing\r\nresults \u003c- class_cv(\r\n  data = ad_data,\r\n  target = \"ad.\",\r\n  models = c(\"knn\", \"svm\", \"decisiontree\", \"xgboost\"),\r\n  train_params = list(\r\n    split = 0.8,\r\n    n_folds = 5,\r\n    random_seed = 123\r\n  ),\r\n  model_params = list(map_args = map_args)\r\n)\r\n\r\n# Get end time\r\nend \u003c- proc.time() - start\r\n\r\n# Print time\r\nprint(end)\r\n\r\nprint(\"Parallel Processing:\")\r\n\r\n# Adjust maximum object size that can be passed to workers during parallel processing; ~1.2 gb\r\noptions(future.globals.maxSize = 1200 * 1024^2)\r\n\r\n# Obtain start time\r\nstart_par \u003c- proc.time()\r\n\r\n# Run model using parallel processing with 4 cores\r\nresults \u003c- class_cv(\r\n  data = ad_data,\r\n  target = \"ad.\",\r\n  models = c(\"knn\", \"svm\", \"decisiontree\", \"xgboost\"),\r\n  train_params = list(\r\n    split = 0.8,\r\n    n_folds = 5,\r\n    random_seed = 123\r\n  ),\r\n  model_params = list(map_args = map_args),\r\n  parallel_configs = list(\r\n    n_cores = 6,\r\n    future.seed = 100\r\n  )\r\n)\r\n\r\n# Obtain end time\r\nend_par \u003c- proc.time() - start_par\r\n\r\n# Print time\r\nprint(end_par)\r\n```\r\n\r\n\r\n\u003cdetails\u003e\r\n\r\n\u003csummary\u003e\u003cstrong\u003eOutput\u003c/strong\u003e\u003c/summary\u003e\r\n\r\n```\r\n[1] \"Without Parallel Processing:\"\r\n\r\nWarning message:\r\nIn .create_dictionary(preprocessed_data[, vars$target]) :\r\n  creating keys for target variable due to 'logistic' or 'xgboost' being specified;\r\n  classes are now encoded: ad. = 0, nonad. = 1\r\n\r\n   user  system elapsed \r\n 231.08    3.50  217.13 \r\n\r\n[1] \"Parallel Processing:\"\r\n\r\nWarning message:\r\nIn .create_dictionary(preprocessed_data[, vars$target]) :\r\n  creating keys for target variable due to 'logistic' or 'xgboost' being specified;\r\n  classes are now encoded: ad. = 0, nonad. = 1\r\n\r\n   user  system elapsed \r\n   2.06    5.89  103.59 \r\n```\r\n\u003c/details\u003e\r\n\r\n```R\r\n# Print parameter information and model evaluation metrics; If number of features \u003e 20, the target replaces the formula\r\nresults$print(models = c(\"xgboost\", \"knn\"))\r\n```\r\n\u003cdetails\u003e\r\n\r\n\u003csummary\u003e\u003cstrong\u003eOutput\u003c/strong\u003e\u003c/summary\u003e\r\n\r\n```\r\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \r\n\r\n\r\nModel: Extreme Gradient Boosting \r\n\r\nTarget: ad.\r\n\r\nNumber of Features: 1558\r\n\r\nClasses: ad., nonad.\r\n\r\nTraining Parameters: list(split = 0.8, n_folds = 5, stratified = FALSE, random_seed = 123, standardize = FALSE, remove_obs = FALSE)\r\n\r\nModel Parameters: list(map_args = list(xgboost = list(params = list(booster = \"gbtree\", objective = \"reg:logistic\", lambda = 3e-04, alpha = 3e-04, eta = 0.8, max_depth = 6), nrounds = 10)), logistic_threshold = 0.5, final_model = FALSE)\r\n\r\nUnlabeled Observations: 0\r\n\r\nIncomplete Labeled Observations: 0\r\n\r\nObservations Missing All Features: 0\r\n\r\nSample Size (Complete Data): 3278\r\n\r\nImputation Parameters: list(method = NULL, args = NULL)\r\n\r\nParallel Configs: list(n_cores = 6, future.seed = 100)\r\n\r\n\r\n\r\nTraining \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\n\r\nClassification Accuracy:  0.99 \r\n\r\nClass:      Precision:  Recall:       F1:\r\n\r\nad.               0.98     0.93      0.96 \r\nnonad.            0.99     1.00      0.99 \r\n\r\n\r\nTest \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\n\r\nClassification Accuracy:  0.98 \r\n\r\nClass:      Precision:  Recall:       F1:\r\n\r\nad.               0.99     0.85      0.91 \r\nnonad.            0.97     1.00      0.99 \r\n\r\n\r\nCross-validation (CV) \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\n\r\nAverage Classification Accuracy:  0.98 ± 0.01 (SD) \r\n\r\nClass:          Average Precision:        Average Recall:            Average F1:\r\n\r\nad.               0.95 ± 0.02 (SD)       0.88 ± 0.04 (SD)       0.91 ± 0.02 (SD) \r\nnonad.            0.98 ± 0.01 (SD)       0.99 ± 0.00 (SD)       0.99 ± 0.00 (SD) \r\n\r\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \r\n\r\n\r\nModel: K-Nearest Neighbors \r\n\r\nTarget: ad.\r\n\r\nNumber of Features: 1558\r\n\r\nClasses: ad., nonad.\r\n\r\nTraining Parameters: list(split = 0.8, n_folds = 5, stratified = FALSE, random_seed = 123, standardize = FALSE, remove_obs = FALSE)\r\n\r\nModel Parameters: list(map_args = list(knn = list(ks = 5)), final_model = FALSE)\r\n\r\nUnlabeled Observations: 0\r\n\r\nIncomplete Labeled Observations: 0\r\n\r\nObservations Missing All Features: 0\r\n\r\nSample Size (Complete Data): 3278\r\n\r\nImputation Parameters: list(method = NULL, args = NULL)\r\n\r\nParallel Configs: list(n_cores = 6, future.seed = 100)\r\n\r\n\r\n\r\nTraining \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\n\r\nClassification Accuracy:  0.99 \r\n\r\nClass:      Precision:  Recall:       F1:\r\n\r\nad.               0.90     1.00      0.95 \r\nnonad.            1.00     0.98      0.99 \r\n\r\n\r\nTest \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\n\r\nClassification Accuracy:  0.91 \r\n\r\nClass:      Precision:  Recall:       F1:\r\n\r\nad.               0.67     0.80      0.73 \r\nnonad.            0.96     0.93      0.95 \r\n\r\n\r\nCross-validation (CV) \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\n\r\nAverage Classification Accuracy:  0.93 ± 0.01 (SD) \r\n\r\nClass:          Average Precision:        Average Recall:            Average F1:\r\n\r\nad.               0.73 ± 0.06 (SD)       0.82 ± 0.05 (SD)       0.77 ± 0.03 (SD) \r\nnonad.            0.97 ± 0.01 (SD)       0.95 ± 0.01 (SD)       0.96 ± 0.01 (SD) \r\n\r\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \r\n```\r\n\r\n\u003c/details\u003e\r\n\r\n\r\n```R\r\n# Plot results\r\nresults$plot(\r\n  models = \"xgboost\",\r\n  class_names = \"ad.\",\r\n  metrics = c(\"precision\", \"recall\"),\r\n  path = getwd()\r\n)\r\n```\r\n\r\n\u003cdetails\u003e\r\n  \r\n  \u003csummary\u003e\u003cstrong\u003ePlots\u003c/strong\u003e\u003c/summary\u003e\r\n\r\n  ![image](assets/ads/extreme_gradient_boosting_cv_precision_ad..png)\r\n  ![image](assets/ads/extreme_gradient_boosting_cv_recall_ad..png)\r\n  ![image](assets/ads/extreme_gradient_boosting_train_test_precision_ad..png)\r\n  ![image](assets/ads/extreme_gradient_boosting_train_test_recall_ad..png)\r\n\r\n\u003c/details\u003e\r\n\r\n\r\n## Acknowledgements\r\nThe development of this package was inspired by other machine learning packages such as\r\ntopepo's [caret](https://github.com/topepo/caret) package, the\r\n[scikit-learn](https://github.com/scikit-learn/scikit-learn) package, and the\r\n[mlr3](https://github.com/mlr-org/mlr3) package.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdonishadsmith%2Fvswift","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdonishadsmith%2Fvswift","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdonishadsmith%2Fvswift/lists"}