{"id":13401159,"url":"https://github.com/rstudio/pointblank","last_synced_at":"2025-05-14T02:04:57.158Z","repository":{"id":37318633,"uuid":"82984541","full_name":"rstudio/pointblank","owner":"rstudio","description":"Data quality assessment and metadata reporting for data frames and database tables","archived":false,"fork":false,"pushed_at":"2025-04-08T20:49:02.000Z","size":110502,"stargazers_count":947,"open_issues_count":89,"forks_count":58,"subscribers_count":32,"default_branch":"main","last_synced_at":"2025-04-15T01:53:26.782Z","etag":null,"topics":["data-assertions","data-checker","data-dictionaries","data-frames","data-inference","data-management","data-profiler","data-quality","data-validation","data-verification","database-tables","easy-to-understand","reporting-tool","schema-validation","testing-tools","yaml-configuration"],"latest_commit_sha":null,"homepage":"https://rstudio.github.io/pointblank/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rstudio.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":"codemeta.json","zenodo":null}},"created_at":"2017-02-24T00:22:29.000Z","updated_at":"2025-04-14T17:39:06.000Z","dependencies_parsed_at":"2023-12-22T03:55:10.416Z","dependency_job_id":"a548c999-b6eb-4505-95f8-b8e16a289604","html_url":"https://github.com/rstudio/pointblank","commit_stats":{"total_commits":5323,"total_committers":20,"mean_commits":266.15,"dds":0.07026113094119857,"last_synced_commit":"9effa69e640cd586b78f3aa30f498e60cdc80dce"},"previous_names":["rstudio/pointblank","rich-iannone/pointblank"],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rstudio%2Fpointblank","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rstudio%2Fpointblank/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rstudio%2Fpointblank/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rstudio%2Fpointblank/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rstudio","download_url":"https://codeload.github.com/rstudio/pointblank/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254052670,"owners_count":22006716,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-assertions","data-checker","data-dictionaries","data-frames","data-inference","data-management","data-profiler","data-quality","data-validation","data-verification","database-tables","easy-to-understand","reporting-tool","schema-validation","testing-tools","yaml-configuration"],"created_at":"2024-07-30T19:00:59.345Z","updated_at":"2025-05-14T02:04:52.142Z","avatar_url":"https://github.com/rstudio.png","language":"R","readme":"\u003cdiv align=\"center\"\u003e\n\u003cbr /\u003e\n\n\u003ca href='https://rstudio.github.io/pointblank/'\u003e\u003cimg src=\"man/figures/logo.svg\" width=\"350px\"/\u003e\u003c/a\u003e\n\n\u003c!-- badges: start --\u003e\n[![CRAN status](https://www.r-pkg.org/badges/version/pointblank)](https://CRAN.R-project.org/package=pointblank)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/license/mit/)\n[![R-CMD-check](https://github.com/rstudio/pointblank/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/rstudio/pointblank/actions/workflows/R-CMD-check.yaml)\n[![Linting](https://github.com/rstudio/pointblank/actions/workflows/lint.yaml/badge.svg)](https://github.com/rstudio/pointblank/actions/workflows/lint.yaml)\n[![Codecov test coverage](https://codecov.io/gh/rstudio/pointblank/graph/badge.svg)](https://app.codecov.io/gh/rstudio/pointblank)\n[![Best Practices](https://bestpractices.coreinfrastructure.org/projects/4310/badge)](https://bestpractices.coreinfrastructure.org/projects/4310)\n[![The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n[![Monthly Downloads](https://cranlogs.r-pkg.org/badges/pointblank)](https://CRAN.R-project.org/package=pointblank)\n[![Total Downloads](https://cranlogs.r-pkg.org/badges/grand-total/pointblank)](https://CRAN.R-project.org/package=pointblank)\n[![Posit Cloud](https://img.shields.io/badge/Posit%20Cloud-pointblank%20Test%20Drive-blue?style=social\u0026logo=rstudio\u0026logoColor=75AADB)](https://rstudio.cloud/project/3411822)\n[![Discord](https://img.shields.io/discord/1345877328982446110?color=%237289da\u0026label=Discord)](https://discord.com/invite/YH7CybCNCQ)\n[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-v2.1%20adopted-ff69b4.svg)](https://www.contributor-covenant.org/version/2/1/code_of_conduct.html)\n\u003c!-- badges: end --\u003e\n\n\u003chr style=\"color:transparent\" /\u003e\n\u003cbr /\u003e\n\u003c/div\u003e\n\nWith the **pointblank** package it’s really easy to methodically validate your\ndata whether in the form of data frames or as database tables. On top of the\nvalidation toolset, the package gives you the means to provide and keep\nup-to-date with the information that *defines* your tables.\n\nFor table *validation*, the *agent* object works with a large collection of\nsimple (yet powerful!) validation functions. We can enable much more\nsophisticated validation checks by using custom expressions, segmenting the\ndata, and by selective mutations of the target table. The suite of validation\nfunctions ensures that *everything just works* no matter whether your table is\na data frame or a database table.\n\nSometimes, we want to maintain table *information* and update it when the table\ngoes through changes. For that, we can use an *informant* object plus associated\nfunctions to help define the metadata entries and present it as a data dictionary.\nJust like we can with validation, **pointblank** offers easy ways to have the\nmetadata updated so that this important documentation doesn't become stale.\n\n\u003chr\u003e\n\n\u003cimg src=\"man/figures/data_quality_reporting_workflow.svg\"\u003e\n\n## TABLE VALIDATIONS WITH AN AGENT AND DATA QUALITY REPORTING\n\nData validation can be carried out in *Data Quality Reporting* workflow, \nultimately resulting in the production of a data quality analysis report.\nThis is most useful in a non-interactive mode where data quality for database\ntables and on-disk data files must be periodically checked. The **pointblank**\n*agent* is given a collection of validation functions to define validation\nsteps. We can get extracts of data rows that failed validation, set up custom\nfunctions that are invoked by exceeding set threshold failure rates, etc. Want\nto email the report regularly (or, only if certain conditions are met)? Yep,\nyou can do all that.\n\nHere is an example of how to use **pointblank** to validate a local table\nwith an *agent*.\n\n``` r\n# Generate a simple `action_levels` object to\n# set the `warn` state if a validation step\n# has a single 'fail' test unit\nal \u003c- action_levels(warn_at = 1)\n\n# Create a pointblank `agent` object, with the\n# tibble as the target table. Use three validation\n# functions, then, `interrogate()`. The agent will\n# then have some useful intel.\nagent \u003c- \n  dplyr::tibble(\n    a = c(5, 7, 6, 5, NA, 7),\n    b = c(6, 1, 0, 6,  0, 7)\n  ) %\u003e%\n  create_agent(\n    label = \"A very *simple* example.\",\n    actions = al\n  ) %\u003e%\n  col_vals_between(\n    columns = a,\n    left = 1,\n    right = 9,\n    na_pass = TRUE\n  ) %\u003e%\n  col_vals_lt(\n    columns = c, 12,\n    preconditions = ~ . %\u003e% dplyr::mutate(c = a + b)\n  ) %\u003e%\n  col_is_numeric(columns = c(a, b)) %\u003e%\n  interrogate()\n```\n\nThe reporting’s pretty sweet. We can get a **gt**-based report by\nprinting an *agent*.\n\n\u003cimg src=\"man/figures/agent_report.png\"\u003e\n\nThe **pointblank** package is designed to be both straightforward yet\npowerful. And fast\\! Local data frames don’t take very long to validate\nextensively and all validation checks on remote tables are done entirely\nin-database. So we can add dozens or even hundreds of validation steps\nwithout any long waits for reporting.\n\nShould you want to perform validation checks on database or *Spark*\ntables, provide a `tbl_dbi` or `tbl_spark` object to `create_agent()`.\nThe **pointblank** package currently supports *PostgreSQL*. *MySQL*,\n*MariaDB*, *Microsoft SQL Server*, *Google BigQuery*, *DuckDB*, *SQLite*, and\n*Spark DataFrames* (through the **sparklyr** package).\n\nHere are some validation reports for the considerably larger \n`intendo::intendo_revenue` table.\n\n\u003ca href=\"https://rpubs.com/rich_i/intendo_rev_postgres\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=Validation%20Report\u0026amp;message=postgres\u0026amp;color=466288\" alt=\"postgres\" /\u003e\u003c/a\u003e   \n\u003ca href=\"https://rpubs.com/rich_i/intendo_rev_mysql\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=Validation%20Report\u0026amp;message=mysql\u0026amp;color=e2af55\" alt=\"mysql\" /\u003e\u003c/a\u003e   \n\u003ca href=\"https://rpubs.com/rich_i/intendo_rev_duckdb\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=Validation%20Report\u0026amp;message=duckdb\u0026amp;color=black\" alt=\"duckdb\" /\u003e\u003c/a\u003e\n\n\u003chr\u003e\n\n\u003cimg src=\"man/figures/pipeline_data_validation_workflow.svg\"\u003e\n\n## VALIDATIONS DIRECTLY ON DATA\n\nThe *Pipeline Data Validation* workflow uses the same collection of validation\nfunctions but without need of an *agent*. This is useful for an ETL process\nwhere we want to periodically check data and trigger warnings, raise errors, or\nwrite out logs when exceeding specified failure thresholds. It’s a cinch to\nperform checks on import of the data and at key points during the transformation\nprocess, perhaps stopping data flow if things are unacceptable with regard to\ndata quality.\n\nThe following example uses the same three validation functions as before but,\nthis time, we use them directly on the data. The validation functions act as a\nfilter, passing data through unless execution is stopped by failing validations\nbeyond the set threshold. In this workflow, by default, an error will occur if\nthere is a single ‘fail’ test unit in any validation step:\n\n``` r\ndplyr::tibble(\n    a = c(5, 7, 6, 5, NA, 7),\n    b = c(6, 1, 0, 6,  0, 7)\n  ) %\u003e%\n  col_vals_between(\n    columns = a,\n    left = 1,\n    right = 9,\n    na_pass = TRUE\n  ) %\u003e%\n  col_vals_lt(\n    columns = c,\n    value = 12,\n    preconditions = ~ . %\u003e% dplyr::mutate(c = a + b)\n  ) %\u003e%\n  col_is_numeric(columns = c(a, b))\n```\n\n    Error: Exceedance of failed test units where values in `c` should have been \u003c `12`.\n    The `col_vals_lt()` validation failed beyond the absolute threshold level (1).\n    * failure level (2) \u003e= failure threshold (1) \n\nWe can downgrade this error to a warning with the `warn_on_fail()` helper\nfunction (assigning it to `actions`). In this way, the data will always be\nreturned, but warnings will appear.\n\n``` r\n# The `warn_on_fail()` function is a nice\n# shortcut for `action_levels(warn_at = 1)`;\n# it works great in this data checking workflow\n# (and the threshold can still be adjusted)\ndplyr::tibble(\n    a = c(5, 7, 6, 5, NA, 7),\n    b = c(6, 1, 0, 6,  0, 7)\n  ) %\u003e%\n  col_vals_between(\n    columns = a,\n    left = 1,\n    right = 9,\n    na_pass = TRUE,\n    actions = warn_on_fail()\n  ) %\u003e%\n  col_vals_lt(\n    columns = c,\n    value = 12,\n    preconditions = ~ . %\u003e% dplyr::mutate(c = a + b),\n    actions = warn_on_fail()\n  ) %\u003e%\n  col_is_numeric(\n    columns = c(a, b),\n    actions = warn_on_fail()\n  )\n```\n\n    #\u003e # A tibble: 6 x 2\n    #\u003e       a     b\n    #\u003e   \u003cdbl\u003e \u003cdbl\u003e\n    #\u003e 1     5     6\n    #\u003e 2     7     1\n    #\u003e 3     6     0\n    #\u003e 4     5     6\n    #\u003e 5    NA     0\n    #\u003e 6     7     7\n\n    Warning message:\n    Exceedance of failed test units where values in `c` should have been \u003c `12`.\n    The `col_vals_lt()` validation failed beyond the absolute threshold level (1).\n    * failure level (2) \u003e= failure threshold (1) \n\nShould you need more fine-grained thresholds and resultant actions, the\n`action_levels()` function can be used to specify multiple failure\nthresholds and side effects for each failure state. However, with\n`warn_on_fail()` and `stop_on_fail()` (applied by default, with\n`stop_at = 1`), you should have good enough options for this validation\nworkflow.\n\n\u003chr\u003e\n\n## VALIDATIONS IN R MARKDOWN DOCUMENTS\n\nUsing **pointblank** in an R Markdown workflow is enabled by default\nonce the **pointblank** library is loaded. The framework allows for\nvalidation testing within specialized validation code chunks where the\n`validate = TRUE` option is set. Using **pointblank** validation\nfunctions on data in these marked code chunks will flag overall failure\nif the stop threshold is exceeded anywhere. All errors are reported in\nthe validation code chunk after rendering the document to HTML, where\ngreen or red status buttons indicate whether all validations succeeded\nor failures occurred. Click them to reveal the otherwise hidden\nvalidation statements and any associated error messages.\n\n\u003cp align=\"center\"\u003e\n\n\u003cimg src=\"man/figures/pointblank_rmarkdown.png\" width=\"100%\" style=\"border:2px solid #021a40;\"\u003e\n\n\u003c/p\u003e\n\nThe above R Markdown document is available as a template in the RStudio\nIDE; it’s called `Pointblank Validation`.\n\n\u003chr\u003e\n\n## TABLE INFORMATION\n\nTable information can be synthesized in an *information management* workflow,\ngiving us a snapshot of a data table we care to collect information on.\nThe **pointblank** *informant* is fed a series of `info_*()` functions to\ndefine bits of information about a table. This info text can pertain to\nindividual columns, the table as a whole, and whatever additional information\nmakes sense for your organization. We can even glean little snippets of\ninformation (like column stats or sample values) from the target table with\n`info_snippet()` and the `snip_*()` functions and mix them into the data\ndictionary wherever they're needed.\n\nHere is an example of how to use **pointblank** to incorporate pieces of\ninfo text into an *informant* object.\n\n``` r\n# Create a pointblank `informant` object, with the\n# tibble as the target table. Use a few information\n# functions and end with `incorporate()`. The informant\n# will then show you information about the tibble.\ninformant \u003c- \n  dplyr::tibble(\n    a = c(5, 7, 6, 5, NA, 7),\n    b = c(6, 1, 0, 6,  0, 7)\n  ) %\u003e%\n  create_informant(\n    label = \"A very *simple* example.\",\n    tbl_name = \"example_tbl\"\n  ) %\u003e%\n  info_tabular(\n    description = \"This two-column table is nothing all that\n    interesting, but, it's fine for examples on **GitHub**\n    `README` pages. Column names are `a` and `b`. ((Cool stuff))\"\n  ) %\u003e%\n  info_columns(\n    columns = a,\n    info = \"This column has an `NA` value. [[Watch out!]]\u003c\u003ccolor: red;\u003e\u003e\"\n  ) %\u003e%\n  info_columns(\n    columns = a,\n    info = \"Mean value is `{a_mean}`.\"\n  ) %\u003e%\n  info_columns(\n    columns = b,\n    info = \"Like column `a`. The lowest value is `{b_lowest}`.\"\n  ) %\u003e%\n  info_columns(\n    columns = b,\n    info = \"The highest value is `{b_highest}`.\"\n  ) %\u003e%\n  info_snippet(\n    snippet_name = \"a_mean\",\n    fn = ~ . %\u003e% .$a %\u003e% mean(na.rm = TRUE) %\u003e% round(2)\n  ) %\u003e%\n  info_snippet(snippet_name = \"b_lowest\", fn = snip_lowest(\"b\")) %\u003e%\n  info_snippet(snippet_name = \"b_highest\", fn = snip_highest(\"b\")) %\u003e%\n  info_section(\n    section_name = \"further information\", \n    `examples and documentation` = \"Examples for how to use the\n    `info_*()` functions (and many more) are available at the\n    [**pointblank** site](https://rstudio.github.io/pointblank/).\"\n  ) %\u003e%\n  incorporate()\n```\n\nBy printing the *informant* we get the table information report.\n\n\u003cimg src=\"man/figures/informant_report.png\"\u003e\n\nHere is a link to a hosted information report for the `intendo::intendo_revenue` table:\n\n[![Information Report for intendo::intendo_revenue](https://img.shields.io/static/v1?label=Information%20Report\u0026message=intendo::intendo_revenue\u0026color=466288)](https://rpubs.com/rich_i/info_revenue_postgres)\n\n\u003chr\u003e\n\n## TABLE SCANS\n\nWe can use the `scan_data()` function to generate a comprehensive summary of a tabular dataset. This allows us to quickly understand what's in the dataset and it helps us determine if there are any peculiarities within the data. Scanning the `dplyr::storms` dataset with `scan_data(tbl = dplyr::storms)` gives us an interactive HTML report. Here are a few of them, published in **RPubs**:\n\n[![Table Scan of dplyr::storms](https://img.shields.io/static/v1?label=Table%20Scan\u0026message=dplyr::storms\u0026color=blue)](https://rpubs.com/rich_i/scan_data_storms)\n\n[![Table Scan of pointblank::game_revenue](https://img.shields.io/static/v1?label=Table%20Scan\u0026message=pointblank::game_revenue\u0026color=blue)](https://rpubs.com/rich_i/scan_data_game_revenue)\n\n\nDatabase tables can be used with `scan_data()` as well. Here are two examples using (1) the `full_region` table of the **Rfam** database (hosted publicly at `mysql-rfam-public.ebi.ac.uk`) and (2) the `assembly` table of the **Ensembl** database (hosted publicly at `ensembldb.ensembl.org`).\n\n[![Rfam:\nfull\\_region](https://img.shields.io/static/v1?label=Table%20Scan\u0026message=Rfam:%20full_region\u0026color=green)](https://rpubs.com/rich_i/rfam_full_region)\n\n[![Ensembl:\nassembly](https://img.shields.io/static/v1?label=Table%20Scan\u0026message=Ensembl:%20assembly\u0026color=green)](https://rpubs.com/rich_i/ensembl_assembly)\n\n\u003chr\u003e\n\n## OVERVIEW OF PACKAGE FUNCTIONS\n\nThere are many functions available in **pointblank** for understanding data\nquality and creating data documentation. Here is an overview of all of them,\ngrouped by family. For much more information on these, visit the\n[documentation website](https://rstudio.github.io/pointblank/) or take\na *Test Drive* in the [Posit Cloud project](https://rstudio.cloud/project/3411822).\n\n\u003cp align=\"center\"\u003e\n\n\u003cimg src=\"man/figures/pointblank_functions.svg\" width=\"100%\"\u003e\n\n\u003c/p\u003e\n\n\u003chr\u003e\n\n## INSTALLATION\n\nWant to try this out? The **pointblank** package is available on **CRAN**:\n\n``` r\ninstall.packages(\"pointblank\")\n```\n\nYou can also install the development version of **pointblank** from **GitHub**:\n\n``` r\n# install.packages(\"pak\")\npak::pak(\"rstudio/pointblank\")\n```\n\n## Getting in Touch\n\nIf you encounter a bug, have usage questions, or want to share ideas to\nmake this package better, feel free to file an\n[issue](https://github.com/rstudio/pointblank/issues).\n\nWanna talk about data validation in a more relaxed setting? Join our\n[_Discord server_](https://discord.com/invite/YH7CybCNCQ)! This is a great option for asking about\nthe development of Pointblank, pitching ideas that may become features, and just sharing your ideas!\n\n[![Discord Server](https://img.shields.io/badge/Discord-Chat%20with%20us-blue?style=social\u0026logo=discord\u0026logoColor=purple)](https://discord.com/invite/YH7CybCNCQ)\n\n-----\n\n## Code of Conduct\n\nPlease note that the gt project is released with a [contributor code of\nconduct](https://www.contributor-covenant.org/version/2/1/code_of_conduct.html).\u003cbr\u003eBy\nparticipating in this project you agree to abide by its terms.\n\n## 📄 License\n\n**pointblank** is licensed under the MIT license.\nSee the [`LICENSE.md`](LICENSE.md) file for more details.\n\n© Posit Software, PBC.\n\n## 🏛️ Governance\n\nThis project is primarily maintained by\n[Rich Iannone](https://bsky.app/profile/richmeister.bsky.social). Other authors may occasionally\nassist with some of these duties.\n\n\u003chr\u003e\n","funding_links":[],"categories":["R"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frstudio%2Fpointblank","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frstudio%2Fpointblank","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frstudio%2Fpointblank/lists"}