{"id":13508026,"url":"https://github.com/dataproofer/Dataproofer","last_synced_at":"2025-03-30T09:33:29.108Z","repository":{"id":6084575,"uuid":"49033842","full_name":"dataproofer/Dataproofer","owner":"dataproofer","description":"A proofreader for your data","archived":false,"fork":false,"pushed_at":"2023-03-01T08:11:29.000Z","size":24635,"stargazers_count":686,"open_issues_count":62,"forks_count":54,"subscribers_count":37,"default_branch":"dev","last_synced_at":"2024-04-13T17:38:33.033Z","etag":null,"topics":["cli","command-line","csv","data-analysis","data-mining","data-science","excel","nodejs","spreadsheet"],"latest_commit_sha":null,"homepage":"http://dataproofer.org/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dataproofer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":".github/FUNDING.yml","license":null,"code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null},"funding":{"github":["dataproofer"],"open_collective":"dataproofer"}},"created_at":"2016-01-05T01:26:11.000Z","updated_at":"2024-04-01T08:38:40.000Z","dependencies_parsed_at":"2023-02-18T10:17:34.304Z","dependency_job_id":null,"html_url":"https://github.com/dataproofer/Dataproofer","commit_stats":{"total_commits":395,"total_committers":10,"mean_commits":39.5,"dds":0.4455696202531646,"last_synced_commit":"0435595218bbfd8f4977260f9d24f50375233e41"},"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataproofer%2FDataproofer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataproofer%2FDataproofer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataproofer%2FDataproofer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataproofer%2FDataproofer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dataproofer","download_url":"https://codeload.github.com/dataproofer/Dataproofer/tar.gz/refs/heads/dev","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246043924,"owners_count":20714495,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","command-line","csv","data-analysis","data-mining","data-science","excel","nodejs","spreadsheet"],"created_at":"2024-08-01T02:00:46.028Z","updated_at":"2025-03-30T09:33:26.572Z","avatar_url":"https://github.com/dataproofer.png","language":"JavaScript","readme":"# Dataproofer\n\n![\"\"](http://i.imgur.com/n38R14S.png)\n\n## A proofreader for your data\n\nEvery day, more and more data is created. Journalists, analysts, and data visualizers turn that data into stories and insights.\n\nBut before you can make use of any data, you need to know if it’s reliable. Is it weird? Is it clean? Can I use it to write or make a viz?\n\nThis used to be a long manual process, using valuable time and introducing the possibility for human error. People can’t always spot every mistake every time, no matter how hard they try.\n\nData proofer is built to automate this process of checking a dataset for errors or potential mistakes.\n\n## Getting Started (Desktop)\n\nDownload a .zip of the latest release [from the Dataproofer releases page](https://github.com/dataproofer/Dataproofer/releases).\n\nDrag the app into your applications folder.\n\nSelect your dataset, which can be either a CSV on your computer, or a Google Sheet that you’ve published to the web.\n\nOnce you select your dataset, you can choose which suites and tests run by turning them on or off.\n\nProof your data, get your results, and feel confident about your dataset.\n\n## Getting Started (Command Line)\n\n```sh\nnpm install -g dataproofer\n```\n\nRead the documentation\n\n```sh\ndataproofer --help\n\u003e  Usage: dataproofer \u003cfile\u003e\n\n  A proofreader for your data\n\n  Options:\n\n    -h, --help          output usage information\n    -V, --version       output the version number\n    -o, --out \u003cfile\u003e    file to output results. default stdout\n    -c, --core          run tests from the core suite\n    -i, --info          run tests from the info suite\n    -a, --stats         run tests from the statistical suite\n    -g, --geo           run tests from the geographic suite\n    -t, --tests \u003clist\u003e  comma-separated list to use\n    -j, --json          output JSON of test results\n    -J, --json-pretty   output an indented JSON of test results\n    -S, --summary       output overall test results, excluding pass/fail results\n    -v, --verbose       include descriptions about each column\n    -x, --exclude       exclude tests that passed\n\n  Examples:\n\n    $ dataproofer my_data.csv\n```\n\nRun a test\n\n```sh\ndataproofer data.csv\n```\n\nSave the results\n\n```sh\ndataproofer --json data.csv --out data.json\n```\n\nLearn how to run specific test suites or tests and output longer or shorter summaries, use the `--help` flag.\n\nFound a bug? [Let us know](https://github.com/dataproofer/Dataproofer/issues/new).\n\n## Table of Contents\n\n- [Getting Started (Desktop)](https://github.com/dataproofer/Dataproofer#getting-started-desktop)\n- [Getting Started (Command Line)](https://github.com/dataproofer/Dataproofer#getting-started-command-line)\n- [Test Suites](https://github.com/dataproofer/Dataproofer#test-suites)\n  - [Info](https://github.com/dataproofer/Dataproofer#information--diagnostics)\n  - [Core](https://github.com/dataproofer/Dataproofer#core-suite)\n  - [Geo](https://github.com/dataproofer/Dataproofer#geo-suite)\n  - [Stats](https://github.com/dataproofer/Dataproofer#stats-suite)\n- [Development](https://github.com/dataproofer/Dataproofer#development)\n  - [How You Can Help](https://github.com/dataproofer/Dataproofer#how-you-can-help)\n  - [Modifying a test suite](https://github.com/dataproofer/Dataproofer#modifying-a-test-suite)\n  - [Create a new test](https://github.com/dataproofer/Dataproofer#creating-a-new-test)\n    - [name](https://github.com/dataproofer/Dataproofer#name)\n    - [description](https://github.com/dataproofer/Dataproofer#description)\n    - [methodology](https://github.com/dataproofer/Dataproofer#methodology)\n    - [helper scripts](https://github.com/dataproofer/Dataproofer#helper-scripts)\n  - [Troubleshooting](https://github.com/dataproofer/Dataproofer#troubleshooting-a-test-that-wont-run)\n  - [Test iteration](https://github.com/dataproofer/Dataproofer#iterating-on-tests)\n  - [Packaging the Desktop App](https://github.com/dataproofer/Dataproofer#packaging-an-executable)\n  - [Releasing new versions](https://github.com/dataproofer/Dataproofer#release)\n- [Sources](https://github.com/dataproofer/Dataproofer#sources)\n- [Thank You](https://github.com/dataproofer/Dataproofer#thank-you)\n\n![\"\"](http://i.imgur.com/3YekdjW.png)\n\n## Test Suites\n\n### [Information \u0026 Diagnostics](https://github.com/dataproofer/Dataproofer/tree/dev/packages/info-suite)\n\nA set of tests that infer descriptive information based on the contents of a table's cells.\n\n- Check for numeric values in columns\n- Check for strings in columns\n\n### [Core Suite](https://github.com/dataproofer/Dataproofer/tree/dev/packages/core-suite)\n\nA set of tests related to common problems and data checks — namely, making sure data has not been truncated by looking for specific cut-off indicators.\n\n- Check for duplicate rows\n- Check for empty columns (no values)\n- Check for special, non-typical Latin characters/letters in strings\n- Check for **big integer** cut-offs as defined by MySQL and PostgreSQL, common database programs\n- Check for **integer** cut-offs as defined by MySQL and PostgreSQL, common database programs\n- Check for **small integer** cut-offs as defined by MySQL and PostgreSQL, common database programs\n- Check for whether there are exactly 65k rows — an indication there may be missing rows lost when the\n  data was exported from a database\n- Check for strings that are exactly 255 characters — an indication there may be missing data lost when the data was exported from MySQL\n\n### [Geo Suite](https://github.com/dataproofer/Dataproofer/tree/dev/packages/geo-suite)\n\nA set of tests related to common geographic data problems.\n\n- Check for invalid latitude and longitude values (values outside the range of -180º to 180º)\n- Check for void latitude and longitude values (values at 0º,0º)\n\n### [Stats Suite](https://github.com/dataproofer/Dataproofer/tree/dev/packages/stats-suite)\n\nA set of test related to common statistical used to detect outlying data.\n\n- Check for outliers within a column relative to the column's median\n- Check for outliers within a column relative to the column's mean\n\n![\"\"](http://i.imgur.com/3YekdjW.png)\n\n## Development\n\n```sh\ngit clone https://github.com/dataproofer/Dataproofer.git\ncd Dataproofer\nyarn\n```\n\n### How You Can Help\n\n#### Write a test\n\nSee our [test to-do list](https://github.com/dataproofer/Dataproofer/issues?q=is%3Aissue+is%3Aopen+label%3Atest) and leave a comment\n\n#### Add a feature\n\nSee our [features list](https://github.com/dataproofer/Dataproofer/issues?utf8=✓\u0026q=is%3Aissue+is%3Aopen+-label%3Atest+) and leave a comment\n\n#### Short on time?\n\nSee our [smaller issues](https://github.com/dataproofer/Dataproofer/issues?q=is%3Aopen+is%3Aissue+label%3Asmall) and leave a comment\n\n#### Got more time?\n\nSee our [medium-sized issues](https://github.com/dataproofer/Dataproofer/issues?utf8=%E2%9C%93\u0026q=is%3Aopen+is%3Aissue+label%3Amedium) and leave a comment\n\n#### Plenty of time?\n\nSee our [larger issues](https://github.com/dataproofer/Dataproofer/issues?utf8=%E2%9C%93\u0026q=is%3Aopen+is%3Aissue+label%3Alarge) and leave a comment\n\n![](http://i.imgur.com/3YekdjW.png)\n\n### Creating a new test\n\n- Make a copy of the [basic test template](https://github.com/dataproofer/suite-template/blob/master/src/myTest.js)\n- Read the comments and follow along with links\n- Let us know if you're running into trouble dataproofer [at] dataproofer.org\n- `require` that test in a suite's [index.js](https://github.com/dataproofer/suite-template/blob/master/index.js)\n- Add that test to the `exports` in [index.js](https://github.com/dataproofer/suite-template/blob/master/index.js)\n\nTests are made up of a few parts. Here's a brief over-view. For a more in-depth look, dive into the [documentation](https://github.com/dataproofer/Dataproofer/tree/dev/packages/dataproofertest-js).\n\n#### .name()\n\nThis is the name of your test. It shows up in the test-selection screen as well as on the results page\n\n#### .description()\n\nThis is a text-only description of what the test does, and what it is meant to check. Imagine you are explaining it to a remarkably intelligent 5-year-old.\n\n#### .methodology()\n\nThis is where the code your test executes lives. Pass it a function that takes in **rows** and **columnHeads**\n\n**rows** is an array of objects from the data. The object uses column headers as the key, and the row’s value as the value.\n\nSo if your data looks like this:\n\n| President         | Year |\n| ----------------- | ---- |\n| George Washington | 1789 |\n| John Adams        | 1797 |\n| Thomas Jefferson  | 1801 |\n\nThen the first object in your array of rows will look like this:\n\n```json\n{ president: ‘George Washington’, year: ‘1789’ }\n```\n\nand so on.\n\nGenerally, to run a test, you are going to want to loop over each row and do some operations on it — counting cells and using conditionals to detect unwanted values.\n\n#### Helper Scripts\n\nHelper scripts help you test and display the results of Dataproofer tests. These are a small set of functions we've found ourselves reusing.\n\n- isEmpty: detect if a cell is empty\n- isNumeric: detect if a cell contains a number\n- stripNumeric: remove number formatting like \"$\" or \"%\"\n- percent: return a number with a \"%\" sign\n\nFor more information, please see the full `util` [documentation](https://github.com/dataproofer/dataproofertest-js/blob/master/DOCUMENTATION.md#util)\n\n![\"\"](http://i.imgur.com/3YekdjW.png)\n\n### Troubleshooting a test that won't run\n\nTests are run inside a try catch loop in `src/processing.js`. You may wish to temporarily remove the try/catch while iterating on a test.\nOtherwise, for now we recommend heavy doses of console.log and the Chrome debugger.\n\n### Iterating on tests\n\nDataproofer saves a copy of the most recently loaded file in the Application Data directory provided to it by the OS.\nYou can quickly load the file and run the tests by typing `loadLastFile()` in the console. This saves you several clicks for loading the file and clicking the run button while you are iterating on a test.\nIf you want to temporarily avoid any clicks you can add the function call to the `ipc.on(\"last-file-selected\",` event handler in `electron/js/controller.js`\n\n### Release a new version\n\nWe can push releases to GitHub manually for now:\n\n```sh\ngit tag -a 'v0.1.1' -m \"first release\"\ngit push \u0026\u0026 git push --tags\n```\n\nThe binary (Dataproofer.app) can be uploaded to the [releases page](https://github.com/dataproofer/Dataproofer/releases) for the tag you pushed, and should be zipped up first (Right click and choose \"Compress Dataproofer\")\n\n![\"\"](http://i.imgur.com/3YekdjW.png)\n\n## Sources\n\n- [A Guide to Bulletproofing Your Data](https://github.com/propublica/guides/blob/master/data-bulletproofing.md), by [ProPublica](https://www.propublica.org/)\n- [Checklist to bulletproof your data work](http://www.tommeagher.com/blog/2012/06/checklist.html), by [Tom Meagher](http://www.tommeagher.com/blog/2012/06/checklist.html) (Data Editor, [The Marshall Project](https://www.themarshallproject.org))\n- [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide), by [Chris Groskopf](github.com/onyxfish) (Things Reporter, [Quartz](http://qz.com))\n- [OpenNewsLabs Data Smells](https://github.com/OpenNewsLabs/datasmells), by [Aurelia Moser](https://github.com/auremoser) ([Mozilla Science Lab](https://www.mozillascience.org/))\n- SRCCON panel notes\n  - [Handguns and tequila: Avoiding data disasters](https://old.etherpad-mozilla.org/MmSOTIOIDg)\n  - [How NOT to Skew with Statistics](https://old.etherpad-mozilla.org/bOwBSAeLe5)\n\n## Thank You\n\n![vocativ-logo](https://cloud.githubusercontent.com/assets/1578563/14050100/e23d531e-f276-11e5-920a-b882eca5933a.png)\u003cbr\u003e\n![knight-logo](https://cloud.githubusercontent.com/assets/1578563/14050167/4b12f330-f277-11e5-9773-1f69b9c2484f.png)\n\nA huge thank you to the [Vocativ](http://vocativ.com) and the [Knight Foundation](http://knightfoundation.org/). This project was funded in part by the Knight Foundation's [Prototype Fund](http://knightfoundation.org/funding-initiatives/knight-prototype-fund/).\n\n### Special Thanks\n\n- Alex Koppelman (interviewee), Editorial Director @ Vocativ\n- Allee Manning (interviewee), Data Reporter @ Vocativ\n- Allegra Denton (design consulting), Designer @ Vocativ\n- Brian Byrne (interviewee), Data Reporter @ Vocativ\n- Daniel Littlewood (video producer), Special Projects Producer @ Vocativ\n- EJ Fox (project lead), Dataviz Editor @ Vocativ\n- Gerald Rich (lead developer), Interactive Producer @ Vocativ\n- Ian Johnson (lead developer), Dataproofer\n- Jason Das (UX and design), Dataproofer\n- Joe Presser (video producer), Dataproofer\n- Julia Kastner (concept \u0026 name consulting), Project Manager @ Vocativ\n- Kelli Vanover (design consulting), Product Manager @ Vocativ\n- Nicu Calcea (developer), Data Projects Editor @ GlobalData Media / New Statesman\n- Markham Nolan (interviewee), Visuals Editor @ Vocativ\n- Rob Di Ieso (design consulting), Art Director @ Vocativ\n\n... and the countless journalists who've encouraged us along the way. Thank you!\n","funding_links":["https://github.com/sponsors/dataproofer","https://opencollective.com/dataproofer"],"categories":["JavaScript","Data Wrangling","data-science","data validation"],"sub_categories":["Placeholder Content"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataproofer%2FDataproofer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdataproofer%2FDataproofer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataproofer%2FDataproofer/lists"}