https://github.com/yevh/anonymizer
Anonymize sensitive data in your datasets.
https://github.com/yevh/anonymizer
anonymize anonymized anonymizer crypto cryptography data-anonymization data-anonymized data-science data-security dataset datasets datasets-csv datasets-preparation python python3 security sensitive sensitive-data
Last synced: 7 months ago
JSON representation
Anonymize sensitive data in your datasets.
- Host: GitHub
- URL: https://github.com/yevh/anonymizer
- Owner: yevh
- Created: 2023-08-16T12:46:25.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-08-17T05:06:10.000Z (over 2 years ago)
- Last Synced: 2024-12-17T01:48:01.062Z (about 1 year ago)
- Topics: anonymize, anonymized, anonymizer, crypto, cryptography, data-anonymization, data-anonymized, data-science, data-security, dataset, datasets, datasets-csv, datasets-preparation, python, python3, security, sensitive, sensitive-data
- Language: Python
- Homepage:
- Size: 1.16 MB
- Stars: 11
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Anonymizer
Anonymize sensitive data in your datasets. It generates pseudonyms for specified columns in a CSV file using a salted SHA-256 hashing function. Integrity checks with HMAC. The script allows for reverting the data back to its original form using encrypted mapping files that are generated during the anonymization process.
## How It Works
- Load or Generate Secret Key (if not exist)
- Key should be base64-encoded 32 bytes (256 bits) long
- Process Input Data File
- Data Pseudonymization or Reversion
- Encrypted Mapping Files
- During the ```anonymize``` operation, for each specified column, the script creates an encrypted file that maps the pseudonyms back to the original data.
- These mapping files are encrypted using the Fernet symmetric encryption scheme, and an HMAC is appended to ensure data integrity.
- Data Integrity
- When reverting data, the script first checks the integrity of the encrypted mapping files by comparing a stored HMAC with a computed HMAC.
- Output
## Install dependencies
```bash
pip3 install -r requirements.txt
```
## How to Use
```bash
python3 anonymizer.py file_path operation --cols column_names --key_path secret_key_path
```
- **file_path:** Path to the data file (CSV format)
- **operation:** ```anonymize``` or ```revert```
- **--cols:** Specific columns to anonymize or revert (all columns by default)
- **--key_path:** Path to the secret key file (required)
## Usage Example
1. Generate a data example
```bash
python3 data.py
```

3. Anonymize
```bash
python3 anonymizer.py data.csv anonymize --key_path secret_key.key
```

3. Revert
```bash
python3 anonymizer.py data.csv revert --key_path secret_key.key
```

## Security
- **Secret Key Storage** - ensure the secret key file is stored securely. If compromised, an attacker could decrypt the pseudonym mappings and de-anonymize the data.
- **Encrypted Mapping Files** - ensure that these files are stored in a secure location with restricted access. Access to these files and the secret key allows data de-anonymization.