{"id":13568761,"url":"https://github.com/tokern/piicatcher","last_synced_at":"2025-04-12T22:35:50.938Z","repository":{"id":39581197,"uuid":"176927554","full_name":"tokern/piicatcher","owner":"tokern","description":"Scan databases and data warehouses for PII data. Tag tables and columns in data catalogs like Amundsen and Datahub","archived":false,"fork":false,"pushed_at":"2024-01-05T17:37:23.000Z","size":1448,"stargazers_count":306,"open_issues_count":25,"forks_count":100,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-04-04T03:05:16.690Z","etag":null,"topics":["aws-athena","aws-glue","aws-redshift","catalog","data","data-catalog","database","phi","pii","python","snowflake"],"latest_commit_sha":null,"homepage":"https://tokern.io/piicatcher/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tokern.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-03-21T11:03:02.000Z","updated_at":"2025-03-21T02:48:44.000Z","dependencies_parsed_at":"2023-02-10T20:30:42.192Z","dependency_job_id":"6507e9a2-eb78-4eeb-9869-92f1e302b714","html_url":"https://github.com/tokern/piicatcher","commit_stats":{"total_commits":256,"total_committers":12,"mean_commits":"21.333333333333332","dds":0.62890625,"last_synced_commit":"e1eab89843886e4b3c7cd01183fca6240d164745"},"previous_names":[],"tags_count":62,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokern%2Fpiicatcher","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokern%2Fpiicatcher/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokern%2Fpiicatcher/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokern%2Fpiicatcher/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tokern","download_url":"https://codeload.github.com/tokern/piicatcher/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248642573,"owners_count":21138351,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws-athena","aws-glue","aws-redshift","catalog","data","data-catalog","database","phi","pii","python","snowflake"],"created_at":"2024-08-01T14:00:31.441Z","updated_at":"2025-04-12T22:35:50.911Z","avatar_url":"https://github.com/tokern.png","language":"Python","funding_links":[],"categories":["Tools","Awesome Privacy Engineering [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)"],"sub_categories":["Tabular / structured","Data Deletion, Data Mapping, and Data Subject Access Requests"],"readme":"[![piicatcher](https://github.com/tokern/piicatcher/actions/workflows/ci.yml/badge.svg)](https://github.com/tokern/piicatcher/actions/workflows/ci.yml)\n[![PyPI](https://img.shields.io/pypi/v/piicatcher.svg)](https://pypi.python.org/pypi/piicatcher)\n[![image](https://img.shields.io/pypi/l/piicatcher.svg)](https://pypi.org/project/piicatcher/)\n[![image](https://img.shields.io/pypi/pyversions/piicatcher.svg)](https://pypi.org/project/piicatcher/)\n[![image](https://img.shields.io/docker/v/tokern/piicatcher)](https://hub.docker.com/r/tokern/piicatcher)\n\n# PII Catcher for Databases and Data Warehouses\n\n## Overview\n\nPIICatcher is a scanner for PII and PHI information. It finds PII data in your databases and file systems\nand tracks critical data. PIICatcher uses two techniques to detect PII:\n\n* Match regular expressions with column names\n* Match regular expressions and using NLP libraries to match sample data in columns.\n\nRead more in the [blog post](https://tokern.io/blog/scan-pii-data-warehouse/) on both these strategies.\n\nPIICatcher is *batteries-included* with a growing set of plugins to scan column metadata as well as metadata. \nFor example, [piicatcher_spacy](https://github.com/tokern/piicatcher_spacy) uses [Spacy](https://spacy.io) to detect\nPII in column data.\n\nPIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy\nscheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources.\n\nThere are ingestion functions for both [Datahub](https://datahubproject.io) and [Amundsen](https://amundsen.io) which will tag columns \nand tables with PII and the type of PII tags.\n\n![PIIcatcher Screencast](https://tokern.io/static/piicatcher-2023-96c7c0d73e20427528633b4f0a0e25f4.gif)\n\n\n## Resources\n\n* [AWS Glue \u0026 Lake Formation Privilege Analyzer](https://tokern.io/blog/lake-glue-access-analyzer/) for an example of how piicatcher is used in production.\n* [Two strategies to scan data warehouses](https://tokern.io/blog/scan-pii-data-warehouse/)\n\n## Quick Start\n\nPIICatcher is available as a docker image or command-line application.\n\n### Installation\n\nDocker:\n\n    alias piicatcher='docker run -v ${HOME}/.config/tokern:/config -u $(id -u ${USER}):$(id -g ${USER}) -it --add-host=host.docker.internal:host-gateway tokern/piicatcher:latest'\n\n\nPypi:\n    # Install development libraries for compiling dependencies.\n    # On Amazon Linux\n    sudo yum install mysql-devel gcc gcc-devel python-devel\n\n    python3 -m venv .env\n    source .env/bin/activate\n    pip install piicatcher\n\n    # Install Spacy plugin\n    pip install piicatcher_spacy\n\n\n### Command Line Usage\n    # add a sqlite source\n    piicatcher catalog add-sqlite --name sqldb --path '/db/sqldb/test.db'\n\n    # run piicatcher on a sqlite db and print report to console\n    piicatcher detect --source-name sqldb\n    ╭─────────────┬─────────────┬─────────────┬─────────────╮\n    │   schema    │    table    │   column    │   has_pii   │\n    ├─────────────┼─────────────┼─────────────┼─────────────┤\n    │        main │    full_pii │           a │           1 │\n    │        main │    full_pii │           b │           1 │\n    │        main │      no_pii │           a │           0 │\n    │        main │      no_pii │           b │           0 │\n    │        main │ partial_pii │           a │           1 │\n    │        main │ partial_pii │           b │           0 │\n    ╰─────────────┴─────────────┴─────────────┴─────────────╯\n\n\n### API Usage\nCode Snippet: \n```python3\nfrom dbcat.api import open_catalog, add_postgresql_source\nfrom piicatcher.api import scan_database\n\n# PIICatcher uses a catalog to store its state. \n# The easiest option is to use a sqlite memory database.\n# For production usage check, https://tokern.io/docs/data-catalog\ncatalog = open_catalog(app_dir='/tmp/.config/piicatcher', path=':memory:', secret='my_secret')\n\nwith catalog.managed_session:\n    # Add a postgresql source\n    source = add_postgresql_source(catalog=catalog, name=\"pg_db\", uri=\"127.0.0.1\", username=\"piiuser\",\n                                    password=\"p11secret\", database=\"piidb\")\n    output = scan_database(catalog=catalog, source=source)\n\nprint(output)\n\n# Example Output\n[\n    ['public', 'sample', 'gender', 'PiiTypes.GENDER'],\n    ['public', 'sample', 'maiden_name', 'PiiTypes.PERSON'],\n    ['public', 'sample', 'lname', 'PiiTypes.PERSON'],\n    ['public', 'sample', 'fname', 'PiiTypes.PERSON'],\n    ['public', 'sample', 'address', 'PiiTypes.ADDRESS'],\n    ['public', 'sample', 'city', 'PiiTypes.ADDRESS'],\n    ['public', 'sample', 'state', 'PiiTypes.ADDRESS'], \n    ['public', 'sample', 'email', 'PiiTypes.EMAIL']\n]\n```\n\n## Plugins\n\nPIICatcher can be extended by creating new detectors. PIICatcher supports two scanning techniques:\n* Metadata\n* Data\n\nPlugins can be created for either of these two techniques. Plugins are then registered using an API or using\n[Python Entry Points](https://packaging.python.org/en/latest/specifications/entry-points/).\n\nTo create a new detector, simply create a new class that inherits from [`MetadataDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py)\nor [`DatumDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py).\n\nIn the new class, define a function `detect` that will return a [`PIIType`](https://github.com/tokern/dbcat/blob/main/dbcat/catalog/pii_types.py) \nIf you are detecting a new PII type, then you can define a new class that inherits from PIIType.\n\nFor detailed documentation, check [piicatcher plugin docs](https://tokern.io/docs/piicatcher/detectors/plugins).\n\n\n## Supported Databases\n\nPIICatcher supports the following databases:\n1. **Sqlite3** v3.24.0 or greater\n2. **MySQL** 5.6 or greater\n3. **PostgreSQL** 9.4 or greater\n4. **AWS Redshift**\n5. **AWS Athena**\n6. **Snowflake**\n7. **BigQuery**\n\n## Documentation\n\nFor advanced usage refer documentation [PIICatcher Documentation](https://tokern.io/docs/piicatcher).\n\n## Survey\n\nPlease take this [survey](https://forms.gle/Ns6QSNvfj3Pr2s9s6) if you are a user or considering using PIICatcher. \nThe responses will help to prioritize improvements to the project.\n\n## Stats Collection\nWe use cookies to a analyse our traffic and features usage.\nWe may share information about your use of our product for our social media and marketing purposes.\nThese cookies don't collect your sensitive and/or confidential information.\nIf you would like to opt out of these cookies, run \n```bash\npiicatcher --disable-stats\n```\nTo Enable:\n```bash\npiicatcher --enable-stats\n```\n\n## Contributing\n\nFor Contribution guidelines, [PIICatcher Developer documentation](https://tokern.io/docs/piicatcher/development). \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftokern%2Fpiicatcher","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftokern%2Fpiicatcher","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftokern%2Fpiicatcher/lists"}