{"id":14067665,"url":"https://github.com/Nelson-Gon/mde","last_synced_at":"2025-07-30T02:31:00.464Z","repository":{"id":44703318,"uuid":"217719436","full_name":"Nelson-Gon/mde","owner":"Nelson-Gon","description":"mde: Missing Data Explorer","archived":false,"fork":false,"pushed_at":"2024-05-28T21:15:57.000Z","size":1439,"stargazers_count":5,"open_issues_count":2,"forks_count":4,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-07-09T01:58:13.938Z","etag":null,"topics":["data-analysis","data-cleaning","data-exploration","data-science","datacleaner","datacleaning","exploratory-data-analysis","missing","missing-data","missing-value-treatment","missing-values","missingness","omit","r","r-package","r-stats","recode","replace","rstats","statistics"],"latest_commit_sha":null,"homepage":"https://nelson-gon.github.io/mde","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Nelson-Gon.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-10-26T14:18:34.000Z","updated_at":"2025-06-28T04:41:37.000Z","dependencies_parsed_at":"2024-02-19T19:16:01.170Z","dependency_job_id":"bb286d94-3412-4212-a2ba-457412032b8d","html_url":"https://github.com/Nelson-Gon/mde","commit_stats":{"total_commits":293,"total_committers":3,"mean_commits":97.66666666666667,"dds":"0.023890784982935176","last_synced_commit":"fa92e7eea6b176ae3f3a38d242ec024c23c83b21"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/Nelson-Gon/mde","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nelson-Gon%2Fmde","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nelson-Gon%2Fmde/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nelson-Gon%2Fmde/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nelson-Gon%2Fmde/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Nelson-Gon","download_url":"https://codeload.github.com/Nelson-Gon/mde/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nelson-Gon%2Fmde/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266923038,"owners_count":24006996,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-24T02:00:09.469Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-cleaning","data-exploration","data-science","datacleaner","datacleaning","exploratory-data-analysis","missing","missing-data","missing-value-treatment","missing-values","missingness","omit","r","r-package","r-stats","recode","replace","rstats","statistics"],"created_at":"2024-08-13T07:05:42.943Z","updated_at":"2025-07-30T02:31:00.417Z","avatar_url":"https://github.com/Nelson-Gon.png","language":"R","readme":"mde: Missing Data Explorer\n================\n2022-01-31\n\n\u003c!-- badges: start --\u003e\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3890659.svg)](https://doi.org/10.5281/zenodo.3890659)\n[![CRAN_Status_Badge](https://r-pkg.org/badges/version/mde)](https://cran.r-project.org/package=mde)\n[![CRAN_Release_Badge](https://www.r-pkg.org/badges/version-ago/mde)](https://CRAN.R-project.org/package=mde)\n[![Codecov test\ncoverage](https://codecov.io/gh/Nelson-Gon/mde/branch/master/graph/badge.svg)](https://codecov.io/gh/Nelson-Gon/mde?branch=master)\n[![R-CMD-check](https://github.com/Nelson-Gon/mde/actions/workflows/devel-check.yaml/badge.svg)](https://github.com/Nelson-Gon/mde/actions/workflows/devel-check.yaml)\n![test-coverage](https://github.com/Nelson-Gon/mde/workflows/test-coverage/badge.svg)\n[![Project\nStatus](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/)\n[![lifecycle](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html)\n[![license](https://img.shields.io/badge/license-GPL--3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0.en.html)\n[![Downloads](https://cranlogs.r-pkg.org/badges/mde)](https://cran.r-project.org/package=mde)\n[![TotalDownloads](https://cranlogs.r-pkg.org/badges/grand-total/mde?color=green)](https://cran.r-project.org/package=mde)\n[![GitHub last\ncommit](https://img.shields.io/github/last-commit/Nelson-Gon/mde.svg)](https://github.com/Nelson-Gon/mde/commits/master)\n[![GitHub\nissues](https://img.shields.io/github/issues/Nelson-Gon/mde.svg)](https://GitHub.com/Nelson-Gon/mde/issues/)\n[![GitHub\nissues-closed](https://img.shields.io/github/issues-closed/Nelson-Gon/mde.svg)](https://GitHub.com/Nelson-Gon/mde/issues?q=is%3Aissue+is%3Aclosed)\n[![PRs\nWelcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](https://makeapullrequest.com)\n[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://GitHub.com/Nelson-Gon/mde/graphs/commit-activity)\n\u003c!-- badges: end --\u003e\n\n\u003cimg src='https://github.com/Nelson-Gon/mde/blob/master/man/figures/mde_icon_2.png?raw=true' align=\"right\" height=\"120\" width=\"120\"/\u003e\n\nThe goal of `mde` is to ease exploration of missingness.\n\n**Installation**\n\n**CRAN release**\n\n``` r\ninstall.packages(\"mde\")\n```\n\n**Stable Development version**\n\n``` r\ndevtools::install_github(\"Nelson-Gon/mde\")\n\n\ndevtools::install_github(\"Nelson-Gon/mde\",  build_vignettes=TRUE)\n```\n\n**Unstable Development version**\n\n``` r\ndevtools::install_github(\"Nelson-Gon/mde@develop\")\n```\n\n**Loading the package**\n\n``` r\nlibrary(mde)\n#\u003e Welcome to mde. This is mde version 0.3.2.\n#\u003e  Please file issues and feedback at https://www.github.com/Nelson-Gon/mde/issues\n#\u003e Turn this message off using 'suppressPackageStartupMessages(library(mde))'\n#\u003e  Happy Exploration :)\n```\n\n## Exploring missingness\n\nTo get a simple missingness report, use `na_summary`:\n\n``` r\nna_summary(airquality)\n#\u003e   variable missing complete percent_complete percent_missing\n#\u003e 1      Day       0      153        100.00000        0.000000\n#\u003e 2    Month       0      153        100.00000        0.000000\n#\u003e 3    Ozone      37      116         75.81699       24.183007\n#\u003e 4  Solar.R       7      146         95.42484        4.575163\n#\u003e 5     Temp       0      153        100.00000        0.000000\n#\u003e 6     Wind       0      153        100.00000        0.000000\n```\n\nTo sort this summary by a given column :\n\n``` r\nna_summary(airquality,sort_by = \"percent_complete\")\n#\u003e   variable missing complete percent_complete percent_missing\n#\u003e 3    Ozone      37      116         75.81699       24.183007\n#\u003e 4  Solar.R       7      146         95.42484        4.575163\n#\u003e 1      Day       0      153        100.00000        0.000000\n#\u003e 2    Month       0      153        100.00000        0.000000\n#\u003e 5     Temp       0      153        100.00000        0.000000\n#\u003e 6     Wind       0      153        100.00000        0.000000\n```\n\nIf one would like to reset (drop) row names, then one can set\n`row_names` to `TRUE` This may especially be useful in cases where\n`rownames` are simply numeric and do not have much additional use.\n\n``` r\nna_summary(airquality,sort_by = \"percent_complete\", reset_rownames = TRUE)\n#\u003e   variable missing complete percent_complete percent_missing\n#\u003e 1    Ozone      37      116         75.81699       24.183007\n#\u003e 2  Solar.R       7      146         95.42484        4.575163\n#\u003e 3      Day       0      153        100.00000        0.000000\n#\u003e 4    Month       0      153        100.00000        0.000000\n#\u003e 5     Temp       0      153        100.00000        0.000000\n#\u003e 6     Wind       0      153        100.00000        0.000000\n```\n\nTo sort by `percent_missing` instead:\n\n``` r\nna_summary(airquality, sort_by = \"percent_missing\")\n#\u003e   variable missing complete percent_complete percent_missing\n#\u003e 1      Day       0      153        100.00000        0.000000\n#\u003e 2    Month       0      153        100.00000        0.000000\n#\u003e 5     Temp       0      153        100.00000        0.000000\n#\u003e 6     Wind       0      153        100.00000        0.000000\n#\u003e 4  Solar.R       7      146         95.42484        4.575163\n#\u003e 3    Ozone      37      116         75.81699       24.183007\n```\n\nTo sort the above in descending order:\n\n``` r\nna_summary(airquality, sort_by=\"percent_missing\", descending = TRUE)\n#\u003e   variable missing complete percent_complete percent_missing\n#\u003e 3    Ozone      37      116         75.81699       24.183007\n#\u003e 4  Solar.R       7      146         95.42484        4.575163\n#\u003e 1      Day       0      153        100.00000        0.000000\n#\u003e 2    Month       0      153        100.00000        0.000000\n#\u003e 5     Temp       0      153        100.00000        0.000000\n#\u003e 6     Wind       0      153        100.00000        0.000000\n```\n\nTo exclude certain columns from the analysis:\n\n``` r\nna_summary(airquality, exclude_cols = c(\"Day\", \"Wind\"))\n#\u003e   variable missing complete percent_complete percent_missing\n#\u003e 1    Month       0      153        100.00000        0.000000\n#\u003e 2    Ozone      37      116         75.81699       24.183007\n#\u003e 3  Solar.R       7      146         95.42484        4.575163\n#\u003e 4     Temp       0      153        100.00000        0.000000\n```\n\nTo include or exclude via regex match:\n\n``` r\nna_summary(airquality, regex_kind = \"inclusion\",pattern_type = \"starts_with\", pattern = \"O|S\")\n#\u003e   variable missing complete percent_complete percent_missing\n#\u003e 1    Ozone      37      116         75.81699       24.183007\n#\u003e 2  Solar.R       7      146         95.42484        4.575163\n```\n\n``` r\nna_summary(airquality, regex_kind = \"exclusion\",pattern_type = \"regex\", pattern = \"^[O|S]\")\n#\u003e   variable missing complete percent_complete percent_missing\n#\u003e 1      Day       0      153              100               0\n#\u003e 2    Month       0      153              100               0\n#\u003e 3     Temp       0      153              100               0\n#\u003e 4     Wind       0      153              100               0\n```\n\nTo get this summary by group:\n\n``` r\ntest2 \u003c- data.frame(ID= c(\"A\",\"A\",\"B\",\"A\",\"B\"), Vals = c(rep(NA,4),\"No\"),ID2 = c(\"E\",\"E\",\"D\",\"E\",\"D\"))\n\nna_summary(test2,grouping_cols = c(\"ID\",\"ID2\"))\n#\u003e # A tibble: 2 x 7\n#\u003e   ID    ID2   variable missing complete percent_complete percent_missing\n#\u003e   \u003cchr\u003e \u003cchr\u003e \u003cchr\u003e      \u003cdbl\u003e    \u003cdbl\u003e            \u003cdbl\u003e           \u003cdbl\u003e\n#\u003e 1 B     D     Vals           1        1               50              50\n#\u003e 2 A     E     Vals           3        0                0             100\n```\n\n``` r\nna_summary(test2, grouping_cols=\"ID\")\n#\u003e Warning in na_summary.data.frame(test2, grouping_cols = \"ID\"): All non grouping\n#\u003e values used. Using select non groups is currently not supported\n#\u003e # A tibble: 4 x 6\n#\u003e   ID    variable missing complete percent_complete percent_missing\n#\u003e   \u003cchr\u003e \u003cchr\u003e      \u003cdbl\u003e    \u003cdbl\u003e            \u003cdbl\u003e           \u003cdbl\u003e\n#\u003e 1 A     Vals           3        0                0             100\n#\u003e 2 A     ID2            0        3              100               0\n#\u003e 3 B     Vals           1        1               50              50\n#\u003e 4 B     ID2            0        2              100               0\n```\n\n-   `get_na_counts`\n\nThis provides a convenient way to show the number of missing values\ncolumn-wise. It is relatively fast(tests done on about 400,000 rows,\ntook a few microseconds.)\n\nTo get the number of missing values in each column of `airquality`, we\ncan use the function as follows:\n\n``` r\nget_na_counts(airquality)\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    37       7    0    0     0   0\n```\n\nThe above might be less useful if one would like to get the results by\ngroup. In that case, one can provide a grouping vector of names in\n`grouping_cols`.\n\n``` r\ntest \u003c- structure(list(Subject = structure(c(1L, 1L, 2L, 2L), .Label = c(\"A\", \n\"B\"), class = \"factor\"), res = c(NA, 1, 2, 3), ID = structure(c(1L, \n1L, 2L, 2L), .Label = c(\"1\", \"2\"), class = \"factor\")), class = \"data.frame\", row.names = c(NA, \n-4L))\n\nget_na_counts(test, grouping_cols = \"ID\")\n#\u003e # A tibble: 2 x 3\n#\u003e   ID    Subject   res\n#\u003e   \u003cfct\u003e   \u003cint\u003e \u003cint\u003e\n#\u003e 1 1           0     1\n#\u003e 2 2           0     0\n```\n\n-   `percent_missing`\n\nThis is a very simple to use but quick way to take a look at the\npercentage of data that is missing column-wise.\n\n``` r\n\npercent_missing(airquality)\n#\u003e      Ozone  Solar.R Wind Temp Month Day\n#\u003e 1 24.18301 4.575163    0    0     0   0\n```\n\nWe can get the results by group by providing an optional `grouping_cols`\ncharacter vector.\n\n``` r\npercent_missing(test, grouping_cols = \"Subject\")\n#\u003e # A tibble: 2 x 3\n#\u003e   Subject   res    ID\n#\u003e   \u003cfct\u003e   \u003cdbl\u003e \u003cdbl\u003e\n#\u003e 1 A          50     0\n#\u003e 2 B           0     0\n```\n\nTo exclude some columns from the above exploration, one can provide an\noptional character vector in `exclude_cols`.\n\n``` r\npercent_missing(airquality,exclude_cols = c(\"Day\",\"Temp\"))\n#\u003e      Ozone  Solar.R Wind Month\n#\u003e 1 24.18301 4.575163    0     0\n```\n\n-   `sort_by_missingness`\n\nThis provides a very simple but relatively fast way to sort variables by\nmissingness. Unless otherwise stated, this does not currently support\narranging grouped percents.\n\nUsage:\n\n``` r\n\nsort_by_missingness(airquality, sort_by = \"counts\")\n#\u003e   variable percent\n#\u003e 1     Wind       0\n#\u003e 2     Temp       0\n#\u003e 3    Month       0\n#\u003e 4      Day       0\n#\u003e 5  Solar.R       7\n#\u003e 6    Ozone      37\n```\n\nTo sort in descending order:\n\n``` r\nsort_by_missingness(airquality, sort_by = \"counts\", descend = TRUE)\n#\u003e   variable percent\n#\u003e 1    Ozone      37\n#\u003e 2  Solar.R       7\n#\u003e 3     Wind       0\n#\u003e 4     Temp       0\n#\u003e 5    Month       0\n#\u003e 6      Day       0\n```\n\nTo use percentages instead:\n\n``` r\nsort_by_missingness(airquality, sort_by = \"percents\")\n#\u003e   variable   percent\n#\u003e 1     Wind  0.000000\n#\u003e 2     Temp  0.000000\n#\u003e 3    Month  0.000000\n#\u003e 4      Day  0.000000\n#\u003e 5  Solar.R  4.575163\n#\u003e 6    Ozone 24.183007\n```\n\n## Recoding as NA\n\n-   `recode_as_na`\n\nAs the name might imply, this converts any value or vector of values to\n`NA` i.e. we take a value such as “missing” or “NA” (not a real `NA`\naccording to `R`) and convert it to R’s known handler for missing values\n(`NA`).\n\nTo use the function out of the box (with default arguments), one simply\ndoes something like:\n\n``` r\ndummy_test \u003c- data.frame(ID = c(\"A\",\"B\",\"B\",\"A\"), \n                         values = c(\"n/a\",NA,\"Yes\",\"No\"))\n# Convert n/a and no to NA\nhead(recode_as_na(dummy_test, value = c(\"n/a\",\"No\")))\n#\u003e   ID values\n#\u003e 1  A   \u003cNA\u003e\n#\u003e 2  B   \u003cNA\u003e\n#\u003e 3  B    Yes\n#\u003e 4  A   \u003cNA\u003e\n```\n\nGreat, but I want to do so for specific columns not the entire dataset.\nYou can do this by providing column names to `subset_cols`.\n\n``` r\n\nanother_dummy \u003c- data.frame(ID = 1:5, Subject = 7:11, \nChange = c(\"missing\",\"n/a\",2:4 ))\n# Only change values at the column Change\nhead(recode_as_na(another_dummy, subset_cols = \"Change\", value = c(\"n/a\",\"missing\")))\n#\u003e   ID Subject Change\n#\u003e 1  1       7   \u003cNA\u003e\n#\u003e 2  2       8   \u003cNA\u003e\n#\u003e 3  3       9      2\n#\u003e 4  4      10      3\n#\u003e 5  5      11      4\n```\n\nTo recode columns using\n[RegEx](https://en.wikipedia.org/wiki/Regular_expression),one can\nprovide `pattern_type` and a target `pattern`. Currently supported\n`pattern_types` are `starts_with`, `ends_with`, `contains` and `regex`.\nSee docs for more details.:\n\n``` r\n# only change at columns that start with Solar\nhead(recode_as_na(airquality,value=190,pattern_type=\"starts_with\",pattern=\"Solar\"))\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    41      NA  7.4   67     5   1\n#\u003e 2    36     118  8.0   72     5   2\n#\u003e 3    12     149 12.6   74     5   3\n#\u003e 4    18     313 11.5   62     5   4\n#\u003e 5    NA      NA 14.3   56     5   5\n#\u003e 6    28      NA 14.9   66     5   6\n```\n\n``` r\n# recode at columns that start with O or S(case sensitive)\nhead(recode_as_na(airquality,value=c(67,118),pattern_type=\"starts_with\",pattern=\"S|O\"))\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    41     190  7.4   67     5   1\n#\u003e 2    36      NA  8.0   72     5   2\n#\u003e 3    12     149 12.6   74     5   3\n#\u003e 4    18     313 11.5   62     5   4\n#\u003e 5    NA      NA 14.3   56     5   5\n#\u003e 6    28      NA 14.9   66     5   6\n```\n\n``` r\n# use my own RegEx\nhead(recode_as_na(airquality,value=c(67,118),pattern_type=\"regex\",pattern=\"(?i)^(s|o)\"))\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    41     190  7.4   67     5   1\n#\u003e 2    36      NA  8.0   72     5   2\n#\u003e 3    12     149 12.6   74     5   3\n#\u003e 4    18     313 11.5   62     5   4\n#\u003e 5    NA      NA 14.3   56     5   5\n#\u003e 6    28      NA 14.9   66     5   6\n```\n\n-   `recode_as_na_if`\n\nThis function allows one to deliberately introduce missing values if a\ncolumn meets a certain threshold of missing values. This is similar to\n`amputation` but is much more basic. It is only provided here because it\nis hoped it may be useful to someone for whatever reason.\n\n``` r\nhead(recode_as_na_if(airquality,sign=\"gt\", percent_na=20))\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    NA     190  7.4   67     5   1\n#\u003e 2    NA     118  8.0   72     5   2\n#\u003e 3    NA     149 12.6   74     5   3\n#\u003e 4    NA     313 11.5   62     5   4\n#\u003e 5    NA      NA 14.3   56     5   5\n#\u003e 6    NA      NA 14.9   66     5   6\n```\n\n-   `recode_as_na_str`\n\nThis allows recoding as `NA` based on a string match.\n\n``` r\npartial_match \u003c- data.frame(A=c(\"Hi\",\"match_me\",\"nope\"), B=c(NA, \"not_me\",\"nah\"))\n\nrecode_as_na_str(partial_match,\"ends_with\",\"ME\", case_sensitive=FALSE)\n#\u003e      A    B\n#\u003e 1   Hi \u003cNA\u003e\n#\u003e 2 \u003cNA\u003e \u003cNA\u003e\n#\u003e 3 nope  nah\n```\n\n-   `recode_as_na_for`\n\nFor all values greater/less/less or equal/greater or equal than some\nvalue, can I convert them to `NA`?!\n\n**Yes You Can!** All we have to do is use `recode_as_na_for`:\n\n``` r\nhead(recode_as_na_for(airquality,criteria=\"gt\",value=25))\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    NA      NA  7.4   NA     5   1\n#\u003e 2    NA      NA  8.0   NA     5   2\n#\u003e 3    12      NA 12.6   NA     5   3\n#\u003e 4    18      NA 11.5   NA     5   4\n#\u003e 5    NA      NA 14.3   NA     5   5\n#\u003e 6    NA      NA 14.9   NA     5   6\n```\n\nTo do so at specific columns, pass an optional `subset_cols` character\nvector:\n\n``` r\nhead(recode_as_na_for(airquality, value=40,subset_cols=c(\"Solar.R\",\"Ozone\"), criteria=\"gt\"))\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    NA      NA  7.4   67     5   1\n#\u003e 2    36      NA  8.0   72     5   2\n#\u003e 3    12      NA 12.6   74     5   3\n#\u003e 4    18      NA 11.5   62     5   4\n#\u003e 5    NA      NA 14.3   56     5   5\n#\u003e 6    28      NA 14.9   66     5   6\n```\n\n## Recoding NA as\n\n-   `recode_na_as`\n\nSometimes, for whatever reason, one would like to replace `NA`s with\nwhatever value they would like. `recode_na_as` provides a very simple\nway to do just that.\n\n``` r\nhead(recode_na_as(airquality))\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    41     190  7.4   67     5   1\n#\u003e 2    36     118  8.0   72     5   2\n#\u003e 3    12     149 12.6   74     5   3\n#\u003e 4    18     313 11.5   62     5   4\n#\u003e 5     0       0 14.3   56     5   5\n#\u003e 6    28       0 14.9   66     5   6\n\n# use NaN\n\nhead(recode_na_as(airquality, value=NaN))\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    41     190  7.4   67     5   1\n#\u003e 2    36     118  8.0   72     5   2\n#\u003e 3    12     149 12.6   74     5   3\n#\u003e 4    18     313 11.5   62     5   4\n#\u003e 5   NaN     NaN 14.3   56     5   5\n#\u003e 6    28     NaN 14.9   66     5   6\n```\n\nAs a “bonus”, you can manipulate the data only at specific columns as\nshown here:\n\n``` r\nhead(recode_na_as(airquality, value=0, subset_cols=\"Ozone\"))\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    41     190  7.4   67     5   1\n#\u003e 2    36     118  8.0   72     5   2\n#\u003e 3    12     149 12.6   74     5   3\n#\u003e 4    18     313 11.5   62     5   4\n#\u003e 5     0      NA 14.3   56     5   5\n#\u003e 6    28      NA 14.9   66     5   6\n```\n\nThe above also supports custom recoding similar to `recode_na_as`:\n\n``` r\nhead(mde::recode_na_as(airquality, value=0, pattern_type=\"starts_with\",pattern=\"Solar\"))\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    41     190  7.4   67     5   1\n#\u003e 2    36     118  8.0   72     5   2\n#\u003e 3    12     149 12.6   74     5   3\n#\u003e 4    18     313 11.5   62     5   4\n#\u003e 5    NA       0 14.3   56     5   5\n#\u003e 6    28       0 14.9   66     5   6\n```\n\n-   `column_based_recode`\n\nEver needed to change values in a given column based on the proportions\nof `NA`s in other columns(row-wise)?!. The goal of `column_based_recode`\nis to achieve just that. Let’s see how we could do this with a simple\nexample:\n\n``` r\n\nhead(column_based_recode(airquality, values_from = \"Wind\", values_to=\"Wind\", pattern_type = \"regex\", pattern = \"Solar|Ozone\"))\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    41     190  7.4   67     5   1\n#\u003e 2    36     118  8.0   72     5   2\n#\u003e 3    12     149 12.6   74     5   3\n#\u003e 4    18     313 11.5   62     5   4\n#\u003e 5    NA      NA  0.0   56     5   5\n#\u003e 6    28      NA 14.9   66     5   6\n```\n\n-   `custom_na_recode`\n\nThis allows recoding `NA` values with common stats functions such as\n`mean`,`max`,`min`,`sd`.\n\nTo use default values:\n\n``` r\nhead(custom_na_recode(airquality))\n#\u003e      Ozone  Solar.R Wind Temp Month Day\n#\u003e 1 41.00000 190.0000  7.4   67     5   1\n#\u003e 2 36.00000 118.0000  8.0   72     5   2\n#\u003e 3 12.00000 149.0000 12.6   74     5   3\n#\u003e 4 18.00000 313.0000 11.5   62     5   4\n#\u003e 5 42.12931 185.9315 14.3   56     5   5\n#\u003e 6 28.00000 185.9315 14.9   66     5   6\n```\n\nTo use select columns:\n\n``` r\n\n\nhead(custom_na_recode(airquality,func=\"mean\",across_columns=c(\"Solar.R\",\"Ozone\")))\n#\u003e      Ozone  Solar.R Wind Temp Month Day\n#\u003e 1 41.00000 190.0000  7.4   67     5   1\n#\u003e 2 36.00000 118.0000  8.0   72     5   2\n#\u003e 3 12.00000 149.0000 12.6   74     5   3\n#\u003e 4 18.00000 313.0000 11.5   62     5   4\n#\u003e 5 42.12931 185.9315 14.3   56     5   5\n#\u003e 6 28.00000 185.9315 14.9   66     5   6\n```\n\nTo use a function from another package to perform replacements:\n\nTo perform a forward fill with `dplyr`’s `lead`:\n\n``` r\n# use lag for a backfill\nhead(custom_na_recode(airquality,func=dplyr::lead ))\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    41     190  7.4   67     5   1\n#\u003e 2    36     118  8.0   72     5   2\n#\u003e 3    12     149 12.6   74     5   3\n#\u003e 4    18     313 11.5   62     5   4\n#\u003e 5    23      99 14.3   56     5   5\n#\u003e 6    28      19 14.9   66     5   6\n```\n\nTo perform replacement by group:\n\n``` r\nsome_data \u003c- data.frame(ID=c(\"A1\",\"A1\",\"A1\",\"A2\",\"A2\", \"A2\"),A=c(5,NA,0,8,3,4),B=c(10,0,0,NA,5,6),C=c(1,NA,NA,25,7,8))\n\nhead(custom_na_recode(some_data,func = \"mean\", grouping_cols = \"ID\"))\n#\u003e # A tibble: 6 x 4\n#\u003e   ID        A     B     C\n#\u003e   \u003cchr\u003e \u003cdbl\u003e \u003cdbl\u003e \u003cdbl\u003e\n#\u003e 1 A1      5    10       1\n#\u003e 2 A1      2.5   0       1\n#\u003e 3 A1      0     0       1\n#\u003e 4 A2      8     5.5    25\n#\u003e 5 A2      3     5       7\n#\u003e 6 A2      4     6       8\n```\n\nAcross specific columns:\n\n``` r\nhead(custom_na_recode(some_data,func = \"mean\", grouping_cols = \"ID\", across_columns = c(\"C\", \"A\")))\n#\u003e # A tibble: 6 x 4\n#\u003e   ID        A     B     C\n#\u003e   \u003cchr\u003e \u003cdbl\u003e \u003cdbl\u003e \u003cdbl\u003e\n#\u003e 1 A1      5      10     1\n#\u003e 2 A1      2.5     0     1\n#\u003e 3 A1      0       0     1\n#\u003e 4 A2      8      NA    25\n#\u003e 5 A2      3       5     7\n#\u003e 6 A2      4       6     8\n```\n\n-   `recode_na_if`\n\nGiven a `data.frame` object, one can recode `NA`s as another value based\non a grouping variable. In the example below, we replace all `NA`s in\nall columns with 0s if the ID is `A2` or `A3`\n\n``` r\nsome_data \u003c- data.frame(ID=c(\"A1\",\"A2\",\"A3\", \"A4\"), \n                        A=c(5,NA,0,8), B=c(10,0,0,1),\n                        C=c(1,NA,NA,25))\n                        \nhead(recode_na_if(some_data,grouping_col=\"ID\", target_groups=c(\"A2\",\"A3\"),\n           replacement= 0))   \n#\u003e # A tibble: 4 x 4\n#\u003e   ID        A     B     C\n#\u003e   \u003cchr\u003e \u003cdbl\u003e \u003cdbl\u003e \u003cdbl\u003e\n#\u003e 1 A1        5    10     1\n#\u003e 2 A2        0     0     0\n#\u003e 3 A3        0     0     0\n#\u003e 4 A4        8     1    25\n```\n\n## Dropping NAs\n\n-   `drop_na_if`\n\nSuppose you wanted to drop any column that has a percentage of `NA`s\ngreater than or equal to a certain value? `drop_na_if` does just that.\n\nWe can drop any columns that have greater than or equal(gteq) to 24% of\nthe values missing from `airquality`:\n\n``` r\nhead(drop_na_if(airquality, sign=\"gteq\",percent_na = 24))\n#\u003e   Solar.R Wind Temp Month Day\n#\u003e 1     190  7.4   67     5   1\n#\u003e 2     118  8.0   72     5   2\n#\u003e 3     149 12.6   74     5   3\n#\u003e 4     313 11.5   62     5   4\n#\u003e 5      NA 14.3   56     5   5\n#\u003e 6      NA 14.9   66     5   6\n```\n\nThe above also supports less than or equal to(`lteq`), equal to(`eq`),\ngreater than(`gt`) and less than(`lt`).\n\nTo keep certain columns despite fitting the target `percent_na`\ncriteria, one can provide an optional `keep_columns` character vector.\n\n``` r\n\nhead(drop_na_if(airquality, percent_na = 24, keep_columns = \"Ozone\"))\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    41     190  7.4   67     5   1\n#\u003e 2    36     118  8.0   72     5   2\n#\u003e 3    12     149 12.6   74     5   3\n#\u003e 4    18     313 11.5   62     5   4\n#\u003e 5    NA      NA 14.3   56     5   5\n#\u003e 6    28      NA 14.9   66     5   6\n```\n\nCompare the above result to the following:\n\n``` r\nhead(drop_na_if(airquality, percent_na = 24))\n#\u003e   Solar.R Wind Temp Month Day\n#\u003e 1     190  7.4   67     5   1\n#\u003e 2     118  8.0   72     5   2\n#\u003e 3     149 12.6   74     5   3\n#\u003e 4     313 11.5   62     5   4\n#\u003e 5      NA 14.3   56     5   5\n#\u003e 6      NA 14.9   66     5   6\n```\n\nTo drop groups that meet a set missingness criterion, we proceed as\nfollows.\n\n``` r\ngrouped_drop \u003c- structure(list(ID = c(\"A\", \"A\", \"B\", \"A\", \"B\"), \n          Vals = c(4, NA,  NA, NA, NA), Values = c(5, 6, 7, 8, NA)), \n          row.names = c(NA, -5L), class = \"data.frame\")\n# Drop all columns for groups that meet a percent missingness of greater than or\n# equal to 67\ndrop_na_if(grouped_drop,percent_na = 67,sign=\"gteq\",\n                                    grouping_cols = \"ID\")\n#\u003e # A tibble: 3 x 3\n#\u003e   ID     Vals Values\n#\u003e   \u003cchr\u003e \u003cdbl\u003e  \u003cdbl\u003e\n#\u003e 1 A         4      5\n#\u003e 2 A        NA      6\n#\u003e 3 A        NA      8\n```\n\n-   `drop_row_if`\n\nThis is similar to `drop_na_if` but does operations rowwise not\ncolumnwise. Compare to the example above:\n\n``` r\n# Drop rows with at least two NAs\nhead(drop_row_if(airquality, sign=\"gteq\", type=\"count\" , value = 2))\n#\u003e Dropped 2 rows.\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    41     190  7.4   67     5   1\n#\u003e 2    36     118  8.0   72     5   2\n#\u003e 3    12     149 12.6   74     5   3\n#\u003e 4    18     313 11.5   62     5   4\n#\u003e 6    28      NA 14.9   66     5   6\n#\u003e 7    23     299  8.6   65     5   7\n```\n\nTo drop based on percentages:\n\n``` r\n# Drops 42 rows\nhead(drop_row_if(airquality, type=\"percent\", value=16, sign=\"gteq\",\n                 as_percent=TRUE))\n#\u003e Dropped 42 rows.\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    41     190  7.4   67     5   1\n#\u003e 2    36     118  8.0   72     5   2\n#\u003e 3    12     149 12.6   74     5   3\n#\u003e 4    18     313 11.5   62     5   4\n#\u003e 7    23     299  8.6   65     5   7\n#\u003e 8    19      99 13.8   59     5   8\n```\n\nFor more details, please see the documentation of `drop_row_if`.\n\n-   `drop_na_at`\n\nThis provides a simple way to drop missing values only at specific\ncolumns. It currently only returns those columns with their missing\nvalues removed. See usage below. Further details are given in the\ndocumentation. It is currently case sensitive.\n\n``` r\nhead(drop_na_at(airquality,pattern_type = \"starts_with\",\"O\"))\n#\u003e   Ozone\n#\u003e 1    41\n#\u003e 2    36\n#\u003e 3    12\n#\u003e 4    18\n#\u003e 5    28\n#\u003e 6    23\n```\n\n-   `drop_all_na`\n\nThis drops columns where all values are missing.\n\n``` r\ntest2 \u003c- data.frame(ID= c(\"A\",\"A\",\"B\",\"A\",\"B\"), Vals = c(4,rep(NA, 4))) \ndrop_all_na(test2, grouping_cols=\"ID\")\n#\u003e # A tibble: 3 x 2\n#\u003e   ID     Vals\n#\u003e   \u003cchr\u003e \u003cdbl\u003e\n#\u003e 1 A         4\n#\u003e 2 A        NA\n#\u003e 3 A        NA\n```\n\nAlternatively, we can drop groups where all variables are all NA.\n\n``` r\ntest2 \u003c- data.frame(ID= c(\"A\",\"A\",\"B\",\"A\",\"B\"), Vals = rep(NA, 5)) \n\nhead(drop_all_na(test, grouping_cols = \"ID\"))\n#\u003e # A tibble: 4 x 3\n#\u003e   Subject   res ID   \n#\u003e   \u003cfct\u003e   \u003cdbl\u003e \u003cfct\u003e\n#\u003e 1 A          NA 1    \n#\u003e 2 A           1 1    \n#\u003e 3 B           2 2    \n#\u003e 4 B           3 2\n```\n\n-   `dict_recode`\n\nIf one would like to recode column values using a “dictionary”,\n`dict_recode` provides a simple way to do that. For example, if one\nwould like to convert `NA` values in `Solar.R` to 520 and those in\n`Ozone` to 42, one simply calls the following:\n\n``` r\nhead(dict_recode(airquality, use_func=\"recode_na_as\",\n                 patterns = c(\"solar\", \"ozone\"),\n                 pattern_type=\"starts_with\", values = c(520,42)))\n#\u003e   Ozone Solar.R Wind Temp Month Day\n#\u003e 1    41     190  7.4   67     5   1\n#\u003e 2    36     118  8.0   72     5   2\n#\u003e 3    12     149 12.6   74     5   3\n#\u003e 4    18     313 11.5   62     5   4\n#\u003e 5    42     520 14.3   56     5   5\n#\u003e 6    28     520 14.9   66     5   6\n```\n\n------------------------------------------------------------------------\n\nPlease note that the `mde` project is released with a [Contributor Code\nof\nConduct](https://github.com/Nelson-Gon/mde/blob/master/.github/CODE_OF_CONDUCT.md).\nBy contributing to this project, you agree to abide by its terms.\n\nFor further exploration, please `browseVignettes(\"mde\")`.\n\nTo raise an issue, please do so\n[here](https://github.com/Nelson-Gon/mde/issues)\n\nThank you, feedback is always welcome :)\n","funding_links":[],"categories":["R"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNelson-Gon%2Fmde","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNelson-Gon%2Fmde","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNelson-Gon%2Fmde/lists"}