{"id":14068370,"url":"https://github.com/the-Hull/datacleanr","last_synced_at":"2025-07-30T04:31:01.377Z","repository":{"id":38371442,"uuid":"256215222","full_name":"the-Hull/datacleanr","owner":"the-Hull","description":"Interactive and Reproducible Data Cleaning","archived":false,"fork":false,"pushed_at":"2025-05-08T17:59:56.000Z","size":25290,"stargazers_count":22,"open_issues_count":3,"forks_count":5,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-07-14T16:51:18.679Z","etag":null,"topics":["annotation-tool","data-cleaning","outlier-detection","outlier-removal","reproducibility"],"latest_commit_sha":null,"homepage":"https://the-hull.github.io/datacleanr","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/the-Hull.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-16T12:56:58.000Z","updated_at":"2025-05-21T00:19:54.000Z","dependencies_parsed_at":"2022-08-21T06:50:39.116Z","dependency_job_id":null,"html_url":"https://github.com/the-Hull/datacleanr","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/the-Hull/datacleanr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/the-Hull%2Fdatacleanr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/the-Hull%2Fdatacleanr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/the-Hull%2Fdatacleanr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/the-Hull%2Fdatacleanr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/the-Hull","download_url":"https://codeload.github.com/the-Hull/datacleanr/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/the-Hull%2Fdatacleanr/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267809431,"owners_count":24147455,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-30T02:00:09.044Z","response_time":70,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotation-tool","data-cleaning","outlier-detection","outlier-removal","reproducibility"],"created_at":"2024-08-13T07:06:07.574Z","updated_at":"2025-07-30T04:30:59.964Z","avatar_url":"https://github.com/the-Hull.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, include = FALSE}\nknitr::opts_chunk$set(\n  eval = FALSE,\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\"\n)\n```\n\n# datacleanr  \u003cimg src=\"man/figures/dcr_logo.png\" align=\"right\" width = \"150\"/\u003e\n\n\u003c!-- badges: start --\u003e\n\u003c!-- [![CRAN status](https://www.r-pkg.org/badges/version/datacleanr)](https://CRAN.R-project.org/package=datacleanr) --\u003e\n\u003c!-- [![Travis build status](https://travis-ci.org/the-Hull/datacleanr.svg?branch=master)](https://travis-ci.org/the-Hull/datacleanr) --\u003e\n[![CircleCI](https://circleci.com/gh/Appsilon/ci.example.svg?style=svg)](https://app.circleci.com/pipelines/github/the-Hull/datacleanr)\n[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html)\n[![](https://cranlogs.r-pkg.org/badges/datacleanr)](https://cran.r-project.org/package=datacleanr)\n[![](https://cranlogs.r-pkg.org/badges/grand-total/datacleanr?color=brightgreen)](https://cran.r-project.org/package=datacleanr)\n\u003c!-- badges: end --\u003e\n\n\n\n\n`datacleanr` is a flexible and efficient tool for **interactive** data cleaning, and is inherently interoperable, as it seamlessly integrates into **reproducible** data analyses pipelines in `R`.\n\nIt can deal with nested **tabular**, as well as **spatial** and **time series** data.\n\n\n\n\n\n## Installation\n\nThe latest release on CRAN can be installed using:\n\n```{r}\ninstall.packages(\"datacleanr\")\n```\n\n\nYou can install the development version of `datacleanr`  with:\n\n``` r\nremotes::install_github(\"the-hull/datacleanr\")\n```\n\n**If you are using macOS, please make sure you have `XQuartz` installed, especially if you've recently updated your system.**\n**See these instructions here: [https://CRAN.R-project.org/bin/macosx/](https://CRAN.R-project.org/bin/macosx/)**\n\n\n## Design\n\n`datacleanr` is developed using the [shiny](https://shiny.rstudio.com/) package, and relies on informative summaries, visual cues and interactive data selection and annotation.\nAll data-altering operations are documented, and converted to valid `R` code (**reproducible recipe**), that can be copied, sent to an active `RStudio` script, or saved to disk.\n\nThere are **four tabs** in the app for these tasks:\n\n- **Set-up \u0026 Overview**: define nesting structure based on (multiple) groups.\n- **Filtering**: use `R` expression to filter/subset data.\n- **Visual Cleaning and Annotating**: generate bivarirate (time series) plots and maps, as well as highlight and annotate individual observations. Cycle through nested groups to expedite exploration and cleaning. Histograms of original vs. 'cleaned' data can be generated.\n- **Extract**: generate reproducible recipe and define outputs. **`dcr_app` also returns all intermediate and final outputs invisibly to the active `R` session for later use (e.g. when batch processing)**\n\nNote, maps require columns `lon` and `lat` (X and Y) in decimal degrees in the data set to render.\n\n## Additional features\n\n- **Grouping**: the grouping defined in the \"Set-up and Overview\" tab is carried forward through the app. These groups can be used to cycle through nested/granular data, and considerably speed up exploration and cleaning. These groups are also available for filtering (Filtering tab), where filter expressions can be scoped to group level (i.e. no groups, individual, all groups).\n- **Interoperability**: \n  when a logical (`TRUE`\\\\`FALSE`) column named `.dcrflag` is present, corresponding observations are rendered with different symbols in plots and maps. Use this feature to validate or cross-check external quality control or outlier flagging methods.\n- **Batching**:\n  If data sets are too large, or too deeply nested (e.g. individual, plot, site, region, etc.), we recommend a split-combine approach to expedite the processing. \n  \n```{r}\n# prepare data into species sub-sets\niris_split \u003c- split(x = iris,\n                    f = iris$Species)\n# run for each species\ndcr_iris \u003c- lapply(iris_split, \n                   function(split){\n                       datacleanr::dcr_app(split)\n                   })\n```\n  \n\n## Getting started\n\nThe documentation for (`?dcr_app()`) explains the basic use and all features.\nThroughout the app, there are conveniently-placed help links that provide details on features.\n\n## Demonstration\n\nLaunch `datacleanr`'s interactive app with `dcr_app()`.\nThe following examples demonstrate basic use and highlight features across the four app tabs.\n\n\n### 1. Set-up \u0026 Overview\n\nDefine the grouping structure (used throughout app for scoping filters and plotting), and generate an informative overview.\n\n```{r}\nlibrary(datacleanr)\n\n# group by species\ndcr_app(iris)\n```\n\n\u003cimg src=\"https://raw.githubusercontent.com/the-Hull/datacleanr/master/man/figures/readme_setup.gif\" width = \"1000\" align = \"center\"/\u003e\n\n### 2. Filtering\n\n\t\nAdd/Remove filter statement boxes, and apply (valid) expressions - either to the entire data set, or scoped to individual groups.\nFiltering relies on `R` expressions passed to `dplyr::filter()`, so, for example, valid statements for `iris` are:\n\n```{r}\n    Species == 'setosa'\n    Species %in% c('setosa','versicolor')\n    Sepal.Width \u003e quantile(Sepal.Width, 0.05)\n```\n\nAny function returning a logical vector (i.e. `TRUE`/`FALSE`), can be employed here!\n\n\n\u003cimg src=\"https://raw.githubusercontent.com/the-Hull/datacleanr/master/man/figures/readme_filter.gif\" width = \"1000\" align = \"center\"/\u003e\n\n\n### 3. Visualizing and annotating\n\nInteractive visualization allow seamless scrolling, panning and zooming to select and annotate individual observations (or sections with lasso/box select tool). \nShow and hide groups using the group selection table (left) or the legend (right). \n\n#### 3.1 General highlighting and annotating\n\n\u003cimg src=\"https://raw.githubusercontent.com/the-Hull/datacleanr/master/man/figures/readme_select.gif\" width = \"1000\" align = \"center\"/\u003e\n\n#### 3.2 Using `.dcrflag` to interface with external QA/QC\n\n```{r}\nlibrary(datacleanr)\nlibrary(dplyr)\n\niris_mod \u003c- iris %\u003e%\ngroup_by(Species) %\u003e%\n  # .dcrflag provides additional visual cue in visualization tab\n  # based on TRUE/FALSE \nmutate(.dcrflag = Sepal.Width \u003c quantile(Sepal.Width, 0.05))\n\n\ndcr_app(iris_mod)\n\n\n```\n\n\u003cimg src=\"https://raw.githubusercontent.com/the-Hull/datacleanr/master/man/figures/readme_select_dcrflag.gif\" width = \"1000\" align = \"center\"/\u003e\n\n#### 3.3 Time Series\n\nAny `numeric` or `POSIXct` column (in X or Y dimension) can be used to visualize time series. \nUse the `Toggle Lines` button above the plot to facilitate exploration.\n\n**Example 1**:\n\n\n```{r}\nlibrary(dplyr)\n\ndplyr::glimpse(treering)\ntree_df \u003c- data.frame(year = -6000:1979,\n           val = treering)\n\n# make synthetic data\ntree_data \u003c- list(tree_A = tree_df,\n                  tree_B = tree_df %\u003e% \n                      mutate(val = val + rnorm(nrow(.), 0.5, 0.2)),\n                  tree_C = tree_df %\u003e% \n                      mutate(val = val + rnorm(nrow(.), mean = -0.03, 0.1))) %\u003e% \n    bind_rows(.id = \"tree\")\n\n# group by tree and inspect\ndcr_app(tree_data)\n\n\n```\n\n\n\u003e (Note, selections are arbitrary and for demonstration only)\n\n\u003cimg src=\"https://raw.githubusercontent.com/the-Hull/datacleanr/master/man/figures/readme_select_timeseries.gif\" width = \"1000\" align = \"center\"/\u003e\n\n\n\n\n**Example 2**:  \n\n\u003e No GIF\n\n```{r}\n\nlibrary(dplyr)\nlibrary(lubridate)\ndata(\"storms\", package = \"dplyr\")\n\nstorms_mod \u003c- storms %\u003e% \n    mutate(timestamp = lubridate::ymd_h(paste(year, month, day, hour)))\n\n# Group by name (198 groups)\n# Check \"Emily\"\ndcr_app(storms_mod)\n\n```\n\n\n#### 3.4 Spatial\n\nInteractive maps rely on [Mapbox](https://www.mapbox.com/) for plotting.\nTherefore, you will need to make an account, from which an access token needs to be copied into your `.Renviron` (e.g. `MAPBOX_TOKEN=your_copied_token`).\nA simple way to do this is using the convenient `usethis` package to access the file:\n\n```{r}\nusethis::edit_r_environ()\n```\n\nSelect columns `lon` and `lat` for plotting to get started.\n\n**Example 1**\n\n```{r}\nlibrary(datacleanr)\nlibrary(dplyr)\n\nairport_data \u003c- read.csv('https://plotly-r.com/data-raw/airport_locations.csv') %\u003e%\n    rename(lon = long)\n\n# group by state\ndcr_app(airport_data)\n\n\n```\n\n\n\u003cimg src=\"https://raw.githubusercontent.com/the-Hull/datacleanr/master/man/figures/readme_select_spatial.gif\" width = \"1000\" align = \"center\"/\u003e\n\n**Example 2**\n\n\u003e No GIF\n\n\n```{r}\n\nlibrary(dplyr)\nlibrary(lubridate)\ndata(\"storms\", package = \"dplyr\")\n\n\nstorms_mod \u003c- storms %\u003e% \n    rename(lon = long)\n\n# Group by name (198 groups)\n# Check \"Bonnie\"\ndcr_app(storms_mod)\n\n```\n\n### 4. Extract (Reproducible Recipe)\n\nAll grouping, filtering and selections/annotations are translated to `R` code, which can be sent to an `RStudio` script, copied to the clipboard, or - when `dcr_app` is launched with a file path - save options are made available.\nFor large selections/annotations we recommend saving the script separately, and sourcing it (i.e. `source(\"your_datacleanr_script.R\")`) during later analyses.\n\n**Caution: When selections / annotations are greater than ~ 1000 points, it is recommended to use `datacleanr` with an `*.RDS` file (see below). This is because the resulting Reproducible Recipe (script) can slow down the RStudio IDE, if it has more than a few thousand lines.The next version of `datacleanr` will allow choosing between script-only recipes, and the option with an the intermediate file for storing annotations. Both approaches with their current implementation are shown shown below.**\n\n\n**Example 1**\n\nLaunching with an object from `R`:\n\n```{r}\nlibrary(datacleanr)\ndcr_app(iris)\n\n```\n\n\u003cimg src=\"https://raw.githubusercontent.com/the-Hull/datacleanr/master/man/figures/readme_extract.gif\" width = \"1000\" align = \"center\"/\u003e\n\nAnd output from extract tab:\n\n```{r}\n# datacleaning with datacleanr (0.0.1)\n# ##------ Wed Oct 07 12:54:03 2020 ------##\n\nlibrary(dplyr)\nlibrary(datacleanr)\n\n#  adding column for unique IDs;\niris$.dcrkey \u003c- seq_len(nrow(iris))\n\n\niris \u003c- dplyr::group_by(iris, Species)\n\n#  stats and scoping level for filtering\nfilter_conditions \u003c- structure(list(filter = \"Sepal.Width \u003e 2.7\", grouping = list(NULL)), row.names = c(NA, \n    -1L), class = c(\"tbl_df\", \"tbl\", \"data.frame\"))\n\n#  applying (scoped) filtering by groups;\niris \u003c- datacleanr::filter_scoped_df(dframe = iris, condition_df = filter_conditions)\n\n#  observations from manual selection (Viz tab);\niris_outlier_selection \u003c- structure(list(.dcrkey = c(15L, 16L, 19L, 34L), .annotation = c(\"\", \"\", \"\", \n    \"\")), class = \"data.frame\", row.names = c(NA, -4L))\n\n#  create data set with annotation column (non-outliers are NA);\niris \u003c- dplyr::left_join(iris, iris_outlier_selection, by = \".dcrkey\")\n\n# remove comment below to drop manually selected obs in data set;\n# iris  \u003c- iris %\u003e% dplyr::filter(is.na(.annotation))\n\n```\n\n**Example 2**\n\nLaunching with an `.RDS` from disk:\n\n```{r}\n\nsaveRDS(iris, file = \"./testiris.Rds\")\n\nlibrary(datacleanr)\ndcr_app(\"./testiris.Rds\")\n\n```\n\n\u003cimg src=\"https://raw.githubusercontent.com/the-Hull/datacleanr/master/man/figures/readme_extract_file.gif\" width = \"1000\" align = \"center\"/\u003e\n\n\n\n----\n\n## Examples:\n\n### 1. Exploring soil respiration with **COSORE**:\n\n**COSORE** is a community-driven soil respiration database, recently introduced with a manuscript published [here]( https://doi.org/10.1111/gcb.15353) by Bond-Lamberty *et al.*.\nThe database provides soil respiration flux estimates, as well as meta data across multiple data sets. \nLet's explore!\n\n\n```{r}\nremotes::install_github(\"bpbond/cosore\")\nlibrary(dplyr)\n\n# check data base info\ndb_info \u003c- cosore::csr_database()\ntibble::glimpse(db_info)\n\n# grab one data set and explore in detail\ndset \u003c- \"d20190409_ANJILELI\"\nanjilleli \u003c- cosore::csr_dataset(dset)\ntibble::glimpse(anjilleli$description)\n\n\ndatacleanr::dcr_app(anjilleli$data)\n```\n\u003cimg src=\"https://raw.githubusercontent.com/the-Hull/datacleanr/master/man/figures/readme_cosore_single.gif\" width = \"1000\" align = \"center\"/\u003e\n\n**Explore sampling locations**:\n\n```{r}\n# Check location info\ndb_info \u003c- db_info %\u003e%\n    mutate(lon = CSR_LONGITUDE,\n           lat = CSR_LATITUDE)\ndatacleanr::dcr_app(db_info)\n\n```\n\n\u003e No GIF\n\n**Explore nested data sets**:\n\n```{r}\n# grab all data from ZHANG\nzhang \u003c- cosore::csr_table(\"data\", c(\"d20190424_ZHANG_maple\",\n                                        \"d20190424_ZHANG_oak\")) %\u003e%\n  # adjust for grouping\n  mutate(CSR_PORT = as.factor(CSR_PORT))\n\n# group by CSR_DATASET and CSR_PORT\ndatacleanr::dcr_app(zhang)\n\n```\n\n\u003cimg src=\"https://raw.githubusercontent.com/the-Hull/datacleanr/master/man/figures/readme_cosore.gif\" width = \"1000\" align = \"center\"/\u003e\n\n---\n\n\nPlease note that the `datacleanr` project is released with a\n[Contributor Code of Conduct](https://raw.githubusercontent.com/the-Hull/datacleanr/master/.github/CODE_OF_CONDUCT.md).\nBy contributing to this project, you agree to abide by its terms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthe-Hull%2Fdatacleanr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthe-Hull%2Fdatacleanr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthe-Hull%2Fdatacleanr/lists"}