{"id":14068005,"url":"https://github.com/ben519/mltools","last_synced_at":"2025-07-30T02:32:39.383Z","repository":{"id":47951209,"uuid":"67892505","full_name":"ben519/mltools","owner":"ben519","description":"Exploratory and diagnostic machine learning tools for R","archived":false,"fork":false,"pushed_at":"2021-09-20T15:54:14.000Z","size":176,"stargazers_count":73,"open_issues_count":8,"forks_count":26,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-07-28T21:41:32.195Z","etag":null,"topics":["exploratory-data-analysis","machine-learning","r"],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ben519.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-09-10T20:18:39.000Z","updated_at":"2025-04-18T11:56:38.000Z","dependencies_parsed_at":"2022-08-21T06:50:10.336Z","dependency_job_id":null,"html_url":"https://github.com/ben519/mltools","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ben519/mltools","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ben519%2Fmltools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ben519%2Fmltools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ben519%2Fmltools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ben519%2Fmltools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ben519","download_url":"https://codeload.github.com/ben519/mltools/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ben519%2Fmltools/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267798625,"owners_count":24145727,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-30T02:00:09.044Z","response_time":70,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["exploratory-data-analysis","machine-learning","r"],"created_at":"2024-08-13T07:05:53.541Z","updated_at":"2025-07-30T02:32:39.130Z","avatar_url":"https://github.com/ben519.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"# mltools\n[![Travis-CI Build Status](https://travis-ci.org/ben519/mltools.svg?branch=master)](https://travis-ci.org/ben519/mltools)\n[![](https://cranlogs.r-pkg.org/badges/mltools)](https://CRAN.R-project.org/package=mltools)\n[![](https://cranlogs.r-pkg.org/badges/grand-total/mltools)](https://CRAN.R-project.org/package=mltools)\n\nExploratory and diagnostic machine learning tools for R\n\nAbout\n------\n\nThe goal of this package is multifold:\n\n- Speed up data preparation for feeding machine-learning models\n- Identify structure and patterns in a dataset\n- Evaluate the results of a machine-learning model\n\nInstallation\n------\n\n#### CRAN\n```r\ninstall.packages(\"mltools\")\n```\n\n#### or Github (development version)\n```r\ninstall.packages(\"devtools\")\ndevtools::install_github(\"ben519/mltools\")\n```\n\nDemonstration\n------\n\nPredict whether or not someone is an alien.\n\n```r\nlibrary(data.table)\nlibrary(mltools)\n\n# Copy the toy datasets since they are locked from being modified\ntrain \u003c- copy(alientrain)\ntest \u003c- copy(alientest)\n\ntrain\n   SkinColor IQScore  Cat1  Cat2   Cat3 IsAlien\n1:     green     300 type1 type1  type4    TRUE\n2:     white      95 type1 type2  type4   FALSE\n3:     brown     105 type2 type6 type11   FALSE\n4:     white     250 type4 type5  type2    TRUE\n5:      blue     115 type2 type7 type11    TRUE\n6:     white      85 type4 type5  type2   FALSE\n7:     green     130 type1 type2  type4    TRUE\n8:     white     115 type1 type1  type4   FALSE\n\ntest\n   SkinColor IQScore  Cat1  Cat2  Cat3\n1:     white      79 type4 type5 type2\n2:     green     100 type4 type5 type2\n3:     brown     125 type3 type9 type7\n4:     white      90 type1 type8 type4\n5:       red     115 type1 type2 type4\n```\n\n### Questions about the data:\n- Are there any pairs of categorical fields which are highly/perfectly correlated?\n- Are there any parent-child related categorical fields?\n- How does the target variable change with IQScore?\n- What's the cardinality and skewness of each feature?\n\n```r\n# Combine train (excluding IsAlien) and test\nalien.all \u003c- rbind(train[, !\"IsAlien\", with=FALSE], test)\n\n#--------------------------------------------------\n## Check for correlated and hierarchical fields\n\ngini_impurities(alien.all, wide=TRUE)  #  weighted conditional gini impurities\n        Var1      Cat1      Cat2      Cat3 SkinColor\n1:      Cat1 0.0000000 0.3589744 0.0000000 0.4743590\n2:      Cat2 0.0000000 0.0000000 0.0000000 0.3461538\n3:      Cat3 0.0000000 0.3589744 0.0000000 0.4743590\n4: SkinColor 0.4102564 0.5384615 0.4102564 0.0000000\n\n# (Cat1, Cat3) = (Cat3, Cat1) = 0 =\u003e Cat1 and Cat3 perfectly correspond to each other\n# (Cat1, Cat2) \u003e 0 and (Cat2, Cat1) = 0 =\u003e Cat1-Cat2 exhibit a parent-child relationship.\n# You can guess Cat1 by knowing Cat2, but not vice-versa.\n\n#--------------------------------------------------\n## Check relationship between IQScore and IsAlien by binning IQScore into groups\n\ntrain[, BinIQScore := bin_data(IQScore, bins=seq(0, 300, by=50))]\n   IQScore BinIQScore\n1:     300 [250, 300]\n2:      95  [50, 100)\n3:     105 [100, 150)\n4:     250 [250, 300]\n5:     115 [100, 150)\n6:      85  [50, 100)\n7:     130 [100, 150)\n8:     115 [100, 150)\n\ntrain[, list(Samples=.N, IQScore=mean(IQScore)), keyby=BinIQScore]\n   BinIQScore Samples IQScore\n1:  [50, 100)       2   90.00\n2: [100, 150)       4  116.25\n3: [250, 300]       2  275.00\n\n# Remove column BinIQScore\ntrain[, BinIQScore := NULL]\n\n#--------------------------------------------------\n## Check skewness of fields\n\nskewness(alien.all)\n$SkinColor\n   SkinColor Count       Pcnt\n1:     white     6 0.46153846\n2:     green     3 0.23076923\n3:     brown     2 0.15384615\n4:      blue     1 0.07692308\n5:       red     1 0.07692308\n\n$Cat1\n    Cat1 Count       Pcnt\n1: type1     6 0.46153846\n2: type4     4 0.30769231\n3: type2     2 0.15384615\n4: type3     1 0.07692308\n...\n```\n\n### Preparing for ML model\n- Cateogrical fields in train and test should be factors with the same levels\n- Split the training dataset to do cross validation\n- Convert datasets to sparses matrices\n\n```r\nset.seed(711)\n\n#--------------------------------------------------\n## Set SkinColor as a factor, such that it has the same levels in train and test\n## Set low frequency skin colors (1 or fewer occurences) as \"_other_\"\n\nskincolors \u003c- list(train$SkinColor, test$SkinColor)\nskincolors \u003c- set_factor(skincolors, aggregationThreshold=1)\ntrain[, SkinColor := skincolors[[1]] ]  # update train with the new values\ntest[, SkinColor := skincolors[[2]] ]  # update test with the new values\n\n# Repeat the process above for other categorical fields (without setting low freq. values as \"_other_\")\nfor(col in c(\"Cat1\", \"Cat2\", \"Cat3\")){\n  vals \u003c- list(train[[col]], test[[col]])\n  vals \u003c- set_factor(vals)\n  set(train, j=col, value=vals[[1]])\n  set(test, j=col, value=vals[[2]])\n}\n\n#--------------------------------------------------\n## Randomly split the training data into 2 equally sized datasets\n\n# Partition train into two folds, stratified by IsAlien\ntrain[, FoldID := folds(IsAlien, nfolds=2, stratified=TRUE, seed=2016)]\n\ncvtrain \u003c- train[FoldID==1, !\"FoldID\"]\n   SkinColor IQScore  Cat1  Cat2   Cat3 IsAlien\n1:     green     300 type1 type1  type4    TRUE\n2:     brown     105 type2 type6 type11   FALSE\n3:     green     130 type1 type2  type4    TRUE\n4:     white     115 type1 type1  type4   FALSE\n\ncvtest \u003c- train[FoldID==2, !\"FoldID\"]\n   SkinColor IQScore  Cat1  Cat2   Cat3 IsAlien\n1:     white      95 type1 type2  type4   FALSE\n2:     white     250 type4 type5  type2    TRUE\n3:   _other_     115 type2 type7 type11    TRUE\n4:     white      85 type4 type5  type2   FALSE\n\n#--------------------------------------------------\n## Convert cvtrain and cvtest to sparse matrices\n## Note that unordered factors are one-hot-encoded\n\nlibrary(Matrix)\n\ncvtrain.sparse \u003c- sparsify(cvtrain)\n4 x 21 sparse Matrix of class \"dgCMatrix\"\n     SkinColor__other_ SkinColor_brown SkinColor_green SkinColor_white IQScore Cat1_type1 ...\n[1,]                 .               .               1               .     300          1\n[2,]                 .               1               .               .     105          .\n[3,]                 .               .               1               .     130          1\n[4,]                 .               .               .               1     115          1\n\ncvtest.sparse \u003c- sparsify(cvtest)\n4 x 21 sparse Matrix of class \"dgCMatrix\"\n     SkinColor__other_ SkinColor_brown SkinColor_green SkinColor_white IQScore Cat1_type1 ...\n[1,]                 .               .               .               1      95          1\n[2,]                 .               .               .               1     250          .\n[3,]                 1               .               .               .     115          .\n[4,]                 .               .               .               1      85          .\n```\n\n### Evaluate model\n- What was the model's AUC ROC score?\n- How good was the model's predictions for each sample?\n\n```r\n#--------------------------------------------------\n## Naive model that guesses someone is an alien if their IQScore is \u003e 130\n\ncvtest[, Prediction := ifelse(IQScore \u003e 130, TRUE, FALSE)]\n\n#--------------------------------------------------\n## Evaluate predictions\n\n# Area Under the ROC Curve (AUC ROC)\nauc_roc(preds=cvtest$Prediction, actuals=cvtest$IsAlien)\n0.75\n\n# Individual scores to determine which predictions were good/bad (see help(roc_scores) for details)\ncvtest[, ROCScore := roc_scores(preds=Prediction, actuals=IsAlien)]\ncvtest[order(ROCScore)]\n   SkinColor IQScore  Cat1  Cat2   Cat3 IsAlien Prediction  ROCScore\n1:     white      95 type1 type2  type4   FALSE      FALSE 0.0000000\n2:     white     250 type4 type5  type2    TRUE       TRUE 0.0000000\n3:     white      85 type4 type5  type2   FALSE      FALSE 0.0000000\n4:   _other_     115 type2 type7 type11    TRUE      FALSE 0.1666667\n```\n\n## Contact\nIf you'd like to contact me regarding bugs, questions, or general consulting, feel free to drop me a line - bgorman519@gmail.com\n\n## Support\nFound this package helpful? Show your support and [buy some merch](https://merchonate.com/collections/ben-gorman-gormanalysis)!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fben519%2Fmltools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fben519%2Fmltools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fben519%2Fmltools/lists"}