{"id":13665984,"url":"https://github.com/tonyfischetti/assertr","last_synced_at":"2025-12-12T00:50:06.742Z","repository":{"id":26099898,"uuid":"29544018","full_name":"tonyfischetti/assertr","owner":"tonyfischetti","description":"Assertive programming for R analysis pipelines","archived":false,"fork":false,"pushed_at":"2024-04-11T23:36:15.000Z","size":14621,"stargazers_count":466,"open_issues_count":12,"forks_count":34,"subscribers_count":16,"default_branch":"master","last_synced_at":"2024-07-01T10:25:33.502Z","etag":null,"topics":["analysis-pipeline","assertion-library","assertion-methods","assertions","peer-reviewed","predicate-functions","r","r-package","rstats"],"latest_commit_sha":null,"homepage":"https://docs.ropensci.org/assertr","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tonyfischetti.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":"codemeta.json"}},"created_at":"2015-01-20T18:11:23.000Z","updated_at":"2024-07-01T10:26:14.088Z","dependencies_parsed_at":"2023-09-21T19:22:17.850Z","dependency_job_id":"bb5f592a-27a9-4073-9a0c-09c356bb021a","html_url":"https://github.com/tonyfischetti/assertr","commit_stats":{"total_commits":214,"total_committers":21,"mean_commits":10.19047619047619,"dds":0.3317757009345794,"last_synced_commit":"e4c7e38602b8f6165253e5d3bafc8f14ed797b52"},"previous_names":["tonyfischetti/assertr","ropensci/assertr"],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tonyfischetti%2Fassertr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tonyfischetti%2Fassertr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tonyfischetti%2Fassertr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tonyfischetti%2Fassertr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tonyfischetti","download_url":"https://codeload.github.com/tonyfischetti/assertr/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247226215,"owners_count":20904465,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis-pipeline","assertion-library","assertion-methods","assertions","peer-reviewed","predicate-functions","r","r-package","rstats"],"created_at":"2024-08-02T06:00:55.172Z","updated_at":"2025-12-12T00:50:06.702Z","avatar_url":"https://github.com/tonyfischetti.png","language":"R","readme":"assertr\n===\n\n![assertr logo](https://thepolygram.com/assertrlogo.png)\n\n\u003c!-- badges: start --\u003e\n[![R-CMD-check](https://github.com/ropensci/assertr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ropensci/assertr/actions/workflows/R-CMD-check.yaml)\n[![Codecov test coverage](https://codecov.io/gh/ropensci/assertr/branch/master/graph/badge.svg)](https://app.codecov.io/gh/ropensci/assertr?branch=master)\n[![CRAN status](https://www.r-pkg.org/badges/version/assertr)](https://CRAN.R-project.org/package=assertr)\n[![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/assertr)](https://cran.r-project.org/package=assertr)\n[![rOpenSci software peer-review](https://badges.ropensci.org/23_status.svg)](https://github.com/ropensci/software-review/issues/23)\n\u003c!-- badges: end --\u003e\n\n### What is it?\nThe assertr package supplies a suite of functions designed to verify\nassumptions about data early in an analysis pipeline so that\ndata errors are spotted early and can be addressed quickly.\n\nThis package does not need to be used with the magrittr/dplyr piping\nmechanism but the examples in this README use them for clarity.\n\n### Installation\n\nYou can install the latest version on CRAN like this\n```r\n    install.packages(\"assertr\")\n```\n\nor you can install the bleeding-edge development version like this:\n```r\n    install.packages(\"devtools\")\n    devtools::install_github(\"ropensci/assertr\")\n```\n### What does it look like?\nThis package offers five assertion functions, `assert`, `verify`,\n`insist`, `assert_rows`, and `insist_rows`, that are designed to be used\nshortly after data-loading in an analysis pipeline...\n\nLet’s say, for example, that the R’s built-in car dataset, `mtcars`, was not \nbuilt-in but rather procured from an external source that was known for making\nerrors in data entry or coding. Pretend we wanted to find the average\nmiles per gallon for each number of engine cylinders. We might want to first,\nconfirm\n- that it has the columns \"mpg\", \"vs\", and \"am\"\n- that the dataset contains more than 10 observations\n- that the column for 'miles per gallon' (mpg) is a positive number\n- that the column for ‘miles per gallon’ (mpg) does not contain a datum\nthat is outside 4 standard deviations from its mean, and\n- that the am and vs columns (automatic/manual and v/straight engine,\nrespectively) contain 0s and 1s only\n- each row contains at most 2 NAs\n- each row is unique *jointly* between the \"mpg\", \"am\", and \"wt\" columns\n- each row's mahalanobis distance is within 10 median absolute deviations of\nall the distances (for outlier detection)\n\n\nThis could be written (in order) using `assertr` like this:\n\n```r\n    library(dplyr)\n    library(assertr)\n\n    mtcars %\u003e%\n      verify(has_all_names(\"mpg\", \"vs\", \"am\", \"wt\")) %\u003e%\n      verify(nrow(.) \u003e 10) %\u003e%\n      verify(mpg \u003e 0) %\u003e%\n      insist(within_n_sds(4), mpg) %\u003e%\n      assert(in_set(0,1), am, vs) %\u003e%\n      assert_rows(num_row_NAs, within_bounds(0,2), everything()) %\u003e%\n      assert_rows(col_concat, is_uniq, mpg, am, wt) %\u003e%\n      insist_rows(maha_dist, within_n_mads(10), everything()) %\u003e%\n      group_by(cyl) %\u003e%\n      summarise(avg.mpg=mean(mpg))\n```\n\nIf any of these assertions were violated, an error would have been raised\nand the pipeline would have been terminated early.\n\nLet's see what the error message look like when you chain\na bunch of failing assertions together.\n\n```r\n    \u003e mtcars %\u003e%\n    +   chain_start %\u003e%\n    +   assert(in_set(1, 2, 3, 4), carb) %\u003e%\n    +   assert_rows(rowMeans, within_bounds(0,5), gear:carb) %\u003e%\n    +   verify(nrow(.)==10) %\u003e%\n    +   verify(mpg \u003c 32) %\u003e%\n    +   chain_end\n    There are 7 errors across 4 verbs:\n    -\n             verb redux_fn           predicate     column index value\n    1      assert     \u003cNA\u003e  in_set(1, 2, 3, 4)       carb    30   6.0\n    2      assert     \u003cNA\u003e  in_set(1, 2, 3, 4)       carb    31   8.0\n    3 assert_rows rowMeans within_bounds(0, 5) ~gear:carb    30   5.5\n    4 assert_rows rowMeans within_bounds(0, 5) ~gear:carb    31   6.5\n    5      verify     \u003cNA\u003e       nrow(.) == 10       \u003cNA\u003e     1    NA\n    6      verify     \u003cNA\u003e            mpg \u003c 32       \u003cNA\u003e    18    NA\n    7      verify     \u003cNA\u003e            mpg \u003c 32       \u003cNA\u003e    20    NA\n\n    Error: assertr stopped execution\n```\n\n### What does `assertr` give me?\n\n- `verify` - takes a data frame (its first argument is provided by\nthe `%\u003e%` operator above), and a logical (boolean) expression. Then, `verify`\nevaluates that expression using the scope of the provided data frame. If any\nof the logical values of the expression's result are `FALSE`, `verify` will\nraise an error that terminates any further processing of the pipeline.\n\n- `assert` - takes a data frame, a predicate function, and an arbitrary\nnumber of columns to apply the predicate function to. The predicate function\n(a function that returns a logical/boolean value) is then applied to every\nelement of the columns selected, and will raise an error if it finds any\nviolations. Internally, the `assert` function uses `dplyr`'s\n`select` function to extract the columns to test the predicate function on.\n\n- `insist` - takes a data frame, a predicate-generating function, and an\narbitrary number of columns. For each column, the the predicate-generating\nfunction is applied, returning a predicate. The predicate is then applied to\nevery element of the columns selected, and will raise an error if it finds any\nviolations. The reason for using a predicate-generating function to return a\npredicate to use against each value in each of the selected rows is so\nthat, for example, bounds can be dynamically generated based on what the data\nlook like; this the only way to, say, create bounds that check if each datum is\nwithin x z-scores, since the standard deviation isn't known a priori.\nInternally, the `insist` function uses `dplyr`'s `select` function to extract\nthe columns to test the predicate function on.\n\n- `assert_rows` - takes a data frame, a row reduction function, a predicate\nfunction, and an arbitrary number of columns to apply the predicate function\nto. The row reduction function is applied to the data frame, and returns a value\nfor each row. The predicate function is then applied to every element of vector\nreturned from the row reduction function, and will raise an error if it finds\nany violations. This functionality is useful, for example, in conjunction with\nthe `num_row_NAs()` function to ensure that there is below a certain number of\nmissing values in each row. Internally, the `assert_rows` function uses\n`dplyr`'s`select` function to extract the columns to test the predicate\nfunction on.\n\n- `insist_rows` - takes a data frame, a row reduction function, a\npredicate-generating\nfunction, and an arbitrary number of columns to apply the predicate function\nto. The row reduction function is applied to the data frame, and returns a value\nfor each row. The predicate-generating function is then applied to the vector\nreturned from the row reduction function and the resultant predicate is\napplied to each element of that vector. It will raise an error if it finds any\nviolations. This functionality is useful, for example, in conjunction with\nthe `maha_dist()` function to ensure that there are no flagrant outliers.\nInternally, the `assert_rows` function uses `dplyr`'s`select` function to\nextract the columns to test the predicate function on.\n\n\n`assertr` also offers four (so far) predicate functions designed to be used\nwith the `assert` and `assert_rows` functions:\n\n- `not_na` - that checks if an element is not NA\n- `within_bounds` - that returns a predicate function that checks if a numeric\nvalue falls within the bounds supplied, and\n- `in_set` - that returns a predicate function that checks if an element is\na member of the set supplied. (also allows inverse for \"not in set\")\n- `is_uniq` - that checks to see if each element appears only once\n\n\nand predicate generators designed to be used with the `insist` and `insist_rows`\nfunctions:\n\n- `within_n_sds` - used to dynamically create bounds to check vector elements with\nbased on standard z-scores\n- `within_n_mads` - better method for dynamically creating bounds to check vector\nelements with based on 'robust' z-scores (using median absolute deviation)\n\nand the following row reduction functions designed to be used with `assert_rows`\nand `insist_rows`:\n\n- `num_row_NAs` - counts number of missing values in each row\n- `maha_dist` - computes the mahalanobis distance of each row (for outlier\ndetection). It will coerce categorical variables into numerics if it needs to.\n- `col_concat` - concatenates all rows into strings\n- `duplicated_across_cols` - checking if a row contains a duplicated value\nacross columns\n\nand, finally, some other utilities for use with `verify`\n\n- `has_all_names` - check if the data frame or list has all supplied names\n- `has_only_names` - check that a data frame or list have _only_ the names\nrequested\n- `has_class` - checks if passed data has a particular class\n\n\n### More info\n\nFor more info, check out the `assertr` vignette\n```r\n    \u003e vignette(\"assertr\")\n```\nOr [read it here](https://CRAN.R-project.org/package=assertr/vignettes/assertr.html)\n\n# [![ropensci\\_footer](https://ropensci.org/public_images/github_footer.png)](https://ropensci.org/)\n","funding_links":[],"categories":["R"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftonyfischetti%2Fassertr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftonyfischetti%2Fassertr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftonyfischetti%2Fassertr/lists"}