{"id":13569103,"url":"https://github.com/elastic/anonymize-it","last_synced_at":"2025-05-09T00:07:49.008Z","repository":{"id":33967448,"uuid":"133690053","full_name":"elastic/anonymize-it","owner":"elastic","description":"a general utility for anonymizing data","archived":false,"fork":false,"pushed_at":"2024-08-08T12:45:57.000Z","size":2094,"stargazers_count":122,"open_issues_count":5,"forks_count":23,"subscribers_count":320,"default_branch":"main","last_synced_at":"2025-05-09T00:07:42.789Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/elastic.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-05-16T15:51:49.000Z","updated_at":"2025-03-12T07:01:48.000Z","dependencies_parsed_at":"2023-01-15T03:40:32.418Z","dependency_job_id":"c6e9a06e-4b93-4a4b-b9b8-5c817466ecc4","html_url":"https://github.com/elastic/anonymize-it","commit_stats":{"total_commits":116,"total_committers":10,"mean_commits":11.6,"dds":0.6896551724137931,"last_synced_commit":"303803a40be4ef3241d205cf072ec896bf20e253"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elastic%2Fanonymize-it","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elastic%2Fanonymize-it/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elastic%2Fanonymize-it/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elastic%2Fanonymize-it/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/elastic","download_url":"https://codeload.github.com/elastic/anonymize-it/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253166520,"owners_count":21864482,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T14:00:36.017Z","updated_at":"2025-05-09T00:07:48.959Z","avatar_url":"https://github.com/elastic.png","language":"Python","readme":"# anonymize-it\nA general utility for anonymizing data\n\n`anonymize-it` can be run as a script that accepts a config file specifying the type source, anonymization mappings, and destination and an anonymizer pipeline. Individual pipeline components can also be imported into any python program that wishes to anonymize data. \n\nCurrently, the `anonymize-it` supports two methods for anonymization: \n1) Faker-based: Relies on providers from [`Faker`](http://faker.readthedocs.io) to perform masking of fields. This method is suitable for one-off anonymization usecases, where correlation between data obtained from different sources (indices/clusters) is not necessary.\n\nE.g.:\n\n```\n\u003e\u003e\u003e from faker import Faker\n\u003e\u003e\u003e f = Faker()\n\u003e\u003e\u003e f.file_path()\n'/break/Congress.json'\n```\n2) Hash-based: Uses a unique user/customer ID as a salt to anonymize fields. This method is suitable when anonymization of data needs to be performed regularly and/or if correlation of data from different sources is crucial. \n\nE.g.: A user wants to anonymize network events and process events stored in two separate indices but wants to correlate all activity for a particular host even after anonymization\n\n# Disclaimer\n\n`anonymize-it` is intended to serve as a tool to replace real data values with sensical artificial ones such that the semantics of the data are retained. It is not intended to be used for anonymization requirements of GDPR policies, but rather to aid pseudonymization efforts. There may also be some collisions in high cardinality datasets on using the Faker implementation.\n\n# Instructions for use\n\n## Installation\n\nThis must be run in a virtualenvironment with the correct dependencies installed. These are enumerated in `requirements.txt`\n\n### Install `virtualenv` globally:\n\n```\n[sudo] pip install virtualenv\n```\n\nCreate a virtualenv and install the dependencies of `anonymize-it`\n```\nvirtualenv -p python3 venv\nsource venv/bin/activate\npip install -r requirements.txt\n```\n\nand run:\n\n```\npython anonymize.py configs/config.json\n```\n\n## Quick Start\n\n`anonymize.py` is reproduced below to walk through a simple anonymization pipeline.\n\nFirst load and parse the config file.\n \n```python\nconfig_file = sys.argv[1]\nconfig = read_config(config_file) # opens json file and stores as python dict\nconfig = utils.parse_config(config) # utility function for parsing configuration and setting variables\n```\n\nThen, create the reader as defined in the configuration. `reader_mapping` is used as a dispatcher that maps human reader reader types (e.g. elasticsearch) to reader classes (e.g. `ESReader()`).\n```python\nreader = reader_mapping[config.source['type']]\nreader = reader(config.source['params'], config.masked_fields, config.suppressed_fields)\n```\n\nNext, create the writer in the same way.\n```python\nwriter = writer_mapping[config.dest['type']]\nwriter = writer(config.dest['params'])\n```\n\nFinally, create an anonymizer by passing the reader and writer instances and run `anonymize()`.\n```python\nanon = Anonymizer(reader=reader, writer=writer)\nanon.anonymize()\n```\n\n### Creating your own anonymizer pipeline\n\nAn anonymizer requires a `reader` and a `writer`. Currently, only an elasticsearch reader `readers.ESReader()` and a filesystem writer `writers.FSWriter()` are provided.\n\n#### `readers`\n\nCreating an instance of a reader requires the following:\n\n* a `source` object, which contains parameters about the source. Please note that each reader class requires a different set of parameters. Please consult docstrings for specific parameters. \n* `masked_fields` which is a dictionary that contains field names that should be masked, along with the faker provider to be used for masking, if using the Faker-based anonymization. e.g.: `{\"user.name\": \"user_name\", \"user.email\": \"email\"}`\nIf using the hash-based implementation, `masked_fields` is simply a list of field names to be masked. e.g.: `[\"user.name\", \"user.email\"]`\n* `suppressed_fields` which is a list of fields that should NOT be included in anonymization.\n\n`masked_fields` is required on the reader since the reader is responsible for enumerating the distinct values for each field to be used as a lookup for masking values in the faker-based anonymization.\n\n`suppressed_fields` is required on the reader since we will explicitly exclude these from a search query.\n\nReaders must implement the following methods:\n* `get_data()`, which is responsible for returning data from the source and passing it to the anonymizer.\n* (If using Faker-based anonymization), `create_mappings()`, which is responsible for generating a dictionary to be used by the anonymizer object. The dictionary is structured as so:\n    ```python\n    {\n      \"field.1\": {\n          \"val1.1\": None,\n          \"val1.2\": None,\n          ...,\n          \"val1.n\": None\n        },\n      \"field.2\": {\n          \"val2.1\": None,\n          \"val2.2\": None,\n          ...,\n          \"val2.m\": None\n        }\n    }\n    ``` \nwhere `field.1` and `field.2` are the fields to be anonymized and the `val1.1`, `val1.2` etc. are the distinct values for each field\n\n#### `writers`\n\nCreating an instance of a writer requires the following:\n\n*  A `dest` object, which contains parameters about the destination. Please note that each writer class requires a different set of parameters. Please consult docstrings for specific parameters.\n\nWriters must implement the following methods:\n\n* `write_data()`, which send anonymized data to the destination.\n\n## Run as Script\n\n\n#### `anonymizers`\n\n```\npython anonymize.py configs/config.json\n```\n\n`config.json` defines the work to be done, please see template file at `configs/config.json` for guidance:\n\n*  `source` defines the location of the original data to be anonymized along with the type of reader that should be invoked.\n   *  `source.type`: a reader type. one of:\n      * \"elasticsearch\"\n      * \"csv\" (TBD)\n      * \"json\" (TBD)\n   * `source.params`: parameters allowing for access of data. specific to the reader type.\n      * \"elasticsearch\":\n         * `host`\n         * `index`\n         * `use_ssl`\n         * `auth` (`native` optional)\n* `dest` defines the location where the data should be written back to\n    * `dest.type` a writer type. one of:\n        * \"filesystem\"\n        * \"csv' (TBD)\n        * \"elasticsearch\" (TBD)\n    * `dest.params`: parameters allowing for writing of data. specific to writer types\n       * \"json\":\n          * `directory` : directory to write output json files\n* `anonymization`: type of anonymization i.e. `faker` or `hash`\n* `include`: the fields to mask along with the method for anonymization in case of faker-based anonymization. This is a dict with entries like `{\"field.name\":\"faker.provider.mask\"}`. Please see faker documentation for providers [here](http://faker.readthedocs.io/en/master/providers.html).\nFor hash-based anonymization, this can be a list of fields to be masked like `[\"field.name\"]`.\n* `exclude`: specific fields to exclude\n* `sensitive`: included fields (apart from the masked fields) that should not be completely replaced by a faker/hash substitute, but should be searched for sensitive information\n* `include_rest`: `{true|false}` if true, all fields except excluded fields will be written. if false, only fields specified in `masks` will be written.\n\n### Important notes for Faker-based anonymization\n1) Set the `provider_map` class attribute for the `Anonymizer` class, which is a dict with entries like `{\"field.name\":self.faker.provider.mask}`. Refer `anonymizers.py` for a test configuration of `provider_map`.\n2) If the fields being anonymized have high cardinality, set the `high_cardinality_fields` class attribute for the `Anonymizer` class, which is a dict with entries like `{\"field.name\": [self.faker.provider.mask(10) for _ in range(10)]}`.\n\n### Important notes for hash-based anonymization\n1) The user should have `monitor` privilege for the Elastic environment in which to run the anonymization.\n2) If you are a Cloud user and want to perform hash-based anonymization, you'll need to create an API key in the Elasticsearch Service Console and provide it as input when prompted. To create an API key, follow the instructions [here](https://www.elastic.co/guide/en/cloud/current/ec-api-authentication.html).\n\nIn addition to the above settings, for more fine-grained control over the anonymization, you can also set the following class attributes for `Anonymizer`:\n1) `user_regexes`, which is a dict with entries like `{\"regex.name\": \"regex\"}`. These regexes are used to redact PII (apart from secrets, which is already taken care of) from the `sensitive` fields\n2) `keywords`, which is a list like `[\"keyword1\", \"keyword2\"]`. Documents containing any of the keywords in any of the `sensitive` fields are dropped.\n\n# Adding Masks\n\nFor the faker-based anonymization, the anonymizer class only knows how to use providers that are enumerated in the `provider_map` class attribute. If you would like to add support for new faker providers, please add entries to this dict.\n\n# Adding Readers\n\nReaders can be added to `readers.py`, simply extend the base reader class and implement all abstract methods. Add a new entry to `reader_mapping`\n\n# Adding Writers\n\nReaders can be added to `writers.py`, simply extend the base writer class and implement all abstract methods. Add a new entry to `reader_mapping` \n\n# General Notes\nhttps://stackoverflow.com/questions/17486578/how-can-you-bundle-all-your-python-code-into-a-single-zip-file\n\n# Running Tests\n\nTo run the unit tests, \n1. Create a virtual environment and install dependencies in `requirements.txt`\n2. Execute `py.test` from the top-level repository directory\n","funding_links":[],"categories":["Anonymisation","Awesome Privacy Engineering [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)"],"sub_categories":["De-Identification and Anonymization"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felastic%2Fanonymize-it","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felastic%2Fanonymize-it","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felastic%2Fanonymize-it/lists"}