{"id":27962338,"url":"https://github.com/thoughtworks/pii-anonymizer","last_synced_at":"2025-05-07T19:20:48.148Z","repository":{"id":64777622,"uuid":"430647284","full_name":"thoughtworks/pii-anonymizer","owner":"thoughtworks","description":"data anonymization project","archived":false,"fork":false,"pushed_at":"2023-01-04T04:53:21.000Z","size":362,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":15,"default_branch":"main","last_synced_at":"2024-04-16T07:16:42.705Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thoughtworks.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-11-22T09:45:06.000Z","updated_at":"2023-11-30T21:06:50.000Z","dependencies_parsed_at":"2023-02-01T21:46:32.968Z","dependency_job_id":null,"html_url":"https://github.com/thoughtworks/pii-anonymizer","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thoughtworks%2Fpii-anonymizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thoughtworks%2Fpii-anonymizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thoughtworks%2Fpii-anonymizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thoughtworks%2Fpii-anonymizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thoughtworks","download_url":"https://codeload.github.com/thoughtworks/pii-anonymizer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252941339,"owners_count":21828858,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-07T19:20:46.710Z","updated_at":"2025-05-07T19:20:48.133Z","avatar_url":"https://github.com/thoughtworks.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Protection Framework\nData Protection Framework is a python library/command line application for identification, anonymization and de-anonymization of Personally Identifiable Information data.\n\nThe framework aims to work on a two-fold principle for detecting PII:\n1. Using RegularExpressions using a pattern\n2. Using NLP for detecting NER (Named Entity Recognitions)\n\n## Common Usage\n1. `pip install pii-anonymizer`\n2. Specify configs in `pii-anonymizer.json`\n3. Choose whether to run in standalone or spark mode with `python -m pii_anonymizer.standalone` or `python -m pii_anonymizer.spark`\n\n## Features and Current Status\n\n### Completed\n * Following Global detectors have been completed:\n   * [x] EMAIL_ADDRESS :  An email address identifies the mailbox that emails are sent to or from. The maximum length of the domain name is 255 characters, and the maximum length of the local-part is 64 characters.\n   * [x] CREDIT_CARD_NUMBER : A credit card number is 12 to 19 digits long. They are used for payment transactions globally.\n\n * Following detectors specific to Singapore have been completed:\n   * [x] PHONE_NUMBER : A telephone number.\n   * [x] FIN/NRIC : A unique set of nine alpha-numeric characters on the Singapore National Registration Identity Card.\n   * [x] THAI_ID : 13 numeric digits of Thai Citizen ID\n\n * Following anonymizers have been added\n    * [x] Replacement ('replace'): Replaces a detected sensitive value with a specified surrogate value. Leave the value empty to simply delete detected sensitive value.\n    * [x] Hash ('hash'): Hash detected sensitive value with sha256.\n    * [x] Encryption: Encrypts the original sensitive data value using a Fernet (AES based).\n\nCurrently supported file formats: `csv, parquet`\n\n## Encryption\nTo use encryption as anonymize mode, a compatible encryption key needs to be created and assigned to `PII_SECRET` environment variables. Compatible key can be generated with\n\n`python -m pii_anonymizer.key`\n\nThis will generate output similar to\n```\nKeep this encrypt key safe\n81AOjk7NV66O62QpnFsvCXH8BDB26KM9TIH7pBfZ6PQ=\n```\nTo set this key as an environment variable run\n\n`export PII_SECRET=81AOjk7NV66O62QpnFsvCXH8BDB26KM9TIH7pBfZ6PQ=`\n### TO-DO\nFollowing features  are part of the backlog with more features coming soon\n * Detectors:\n    * [ ] NAME\n    * [ ] ADDRESS\n * Anonymizers:\n    * [ ] Masking: Replaces a number of characters of a sensitive value with a specified surrogate character, such as a hash (#) or asterisk (*).\n    * [ ] Bucketing: \"Generalizes\" a sensitive value by replacing it with a range of values. (For example, replacing a specific age with an age range,\n    or temperatures with ranges corresponding to \"Hot,\" \"Medium,\" and \"Cold.\")\n\n\nYou can have a detailed at upcoming features and backlog in this [Github Board](https://github.com/thoughtworks-datakind/anonymizer/projects/1?fullscreen=true)\n\n## Development setup\n1. Install [Poetry](https://python-poetry.org/docs/#installing-with-the-official-installer)\n2. Setup hooks and install packages with `make install`\n\n### Config JSON\nLimitation: when reading multiple files, all files that matches the file_path must have same headers. Additionally, when file format is not given anonymizer will assume that the file format is the first matched filename. Thus, when the file_path ends with `/*` and the folder contains mixed file format, the operation will fail.\n\nAn example for the config JSON is located at `\u003cPROJECT_ROOT\u003e/pii-anonymizer.json`\n```\n{\n  \"acquire\": {\n    \"file_path\": \u003cFILE PATH TO YOUR INPUT CSV\u003e, -\u003e ./input_data/file.csv or ./input_data/*.csv to read all files that matches\n    \"delimiter\": \u003cYOUR CSV DELIMITER\u003e\n  },\n  \"analyze\": {\n    \"exclude\": ['Exception']\n  },\n  \"anonymize\": {\n    \"mode\": \u003creplace|hash|encrypt\u003e,\n    \"value\": \"string to replace\",\n    \"output_file_path\" : \u003cPATH TO YOUR CSV OUTPUT FOLDER\u003e,\n    \"output_file_format\": \u003ccsv|parquet\u003e,\n    \"output_file_name\": \"anonymized\" -\u003e optionally, specify the output filename.\n  },\n  \"report\" : {\n    \"location\" : \u003cPATH TO YOUR REPORT OUTPUT FOLDER\u003e,\n    \"level\" : \u003cLOG LEVEL\u003e\n  }\n}\n```\n\n### Running Tests\nYou can run the tests by running `make test` or triggering shell script located at `\u003cPROJECT_ROOT\u003e/bin/run_tests.sh`\n\n### Trying out on local\n\n##### Anonymizing a delimited csv file\n1. Set up a JSON config file similar to the one seen at the project root.\nIn the 'acquire' section of the json, populate the input file path and the delimiter.\nIn the 'report' section, provide the output path, where you want the PII detection report to be generated.\nA 'high' level report just calls out which columns have PII attributes.\nA 'medium' level report calls out the percentage of PII in each column and the associated PII (email, credit card, etc)type for the same.\n2. Run the main class - `python -m pii_anonymizer.standalone --config \u003coptionally, path of the config file or leave blank to defaults to pii-anonymizer.json\u003e`\nYou should see the report being appended to the file named 'report_\\\u003cdate\\\u003e.log' in the output path specified in the\nconfig file.\n\n### Packaging\nRun `poetry build` and the `.whl` file will be created in the `dist` folder.\n\n### Licensing\nDistributed under the MIT license. See ``LICENSE`` for more information.\n\n### Contributing\n\nYou want to help out? _Awesome_!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthoughtworks%2Fpii-anonymizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthoughtworks%2Fpii-anonymizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthoughtworks%2Fpii-anonymizer/lists"}