https://github.com/ragin-lundf/csv_pseudonomizer

Python tool to pseudonomize CSV files
https://github.com/ragin-lundf/csv_pseudonomizer

Last synced: 4 months ago
JSON representation

Python tool to pseudonomize CSV files

Host: GitHub
URL: https://github.com/ragin-lundf/csv_pseudonomizer
Owner: Ragin-LundF
Created: 2021-08-29T13:23:24.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2024-03-19T13:37:25.000Z (over 1 year ago)
Last Synced: 2025-02-05T07:13:57.434Z (5 months ago)
Language: Python
Size: 112 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# CSV File pseudonomizer

This tool replaces data in big CSV files with pseudo data.

## Structure

- `generator` - contains the code for the dummy data generator
- `model` - data classes
- `processors` - main processing logic
- `pseudonomizer` - methods to pseudonomize the data
- `utils` - util methods

## Installation

### Prerequisites
#### Cent OS
- GCC
- Python 3.9
- Pip

#### Windows
- Visual C++ Library (https://visualstudio.microsoft.com/de/visual-cpp-build-tools/)
- Python 3.9
- Pip

### Application checkout and preparation
```git
git clone https://github.com/Ragin-LundF/csv_pseudonomizer.git
cd csv_pseudonomizer
pip install -r requirements.txt
```

## Configuration

### General configuration
To configure the tool, please have a look at [config.py](config.py).

### Names list
To replace names in the CSV, the tool requires lists of names which should be replaced.
- [firstnames.txt](pseudonomizer/rules/firstnames.txt)
- [lastnames.txt](pseudonomizer/rules/lastnames.txt)

### Company Regexes
To keep company names, you can add regex to find a company.
This is because companies are not under GDPR.
The regexes can be found under:
- [company_regexes.py](pseudonomizer/rules/company_regexes.py)

## Usage

| Parameter | Description | Example |
| --- | --- | --- |
| none | Print out help | n/a |
| `-d` | Create a dummy file for testing (see `config.py` for file names) | `-d` |
| `--dummy` | | `--dummy` |
| `-i` | Specify the input file for processing | `-i dummy.csv` |
| `--inputfile=` | | `--inputfile=dummy.csv` |
| `-s` | Split CSV file into new files. This option defines the wanted output line amount. | `-s 5000` |
| `--split=` | | `--split=5000` |
| `-a` | Append split files to one output file. It detects the files, which are split by the `-s` parameter. It is required to set input and output file. See examples for split file processing for more information.| `-a` |
| `-o` | Specify the output file for processing | `-o dummy_processed.csv` |
| `--outputfile=` | | `--outputfile=dummy_processed.csv` |
| `--gen_firstnames` | Generates a new list of firstnames in the `pseudonominizer/rules/firstnames.txt` file, depending on the locale in `config.py`. | `--gen_firstnames` |
| `--gen_lastnames` | Generates a new list of lastnames in the `pseudonominizer/rules/lastnames.txt` file, depending on the locale in `config.py`. | `--gen_lastnames` |

### Create a dummy file for testing

```bash
python main.py -d
```

### Split a file into chunks

#### Splitting
Short:
```bash
python main.py -i -s
```

Long:
```bash
python main.py --inputfile= --split=
```

Example:
```bash
python main.py -i dummy.csv -s 5000
```

### Process a file

Short:
```bash
python main.py -i -o
```

Long:
```bash
python main.py --inputfile= --outputfile=
```

Example:
```bash
python main.py -i=dummy.csv -o=dummy_processed.csv
```

### Process a set of split files

Short:
```bash
python main.py -i -o -a
```

Long:
```bash
python main.py --inputfile= --outputfile= -a
```

Example:
```bash
python main.py -i=dummy.csv -o=dummy_processed.csv -a
```
This will detect all `dummy_chunk_*.csv` files and process them into one big output file.
The `-i` parameter is used here to define the main file name.
`-a` takes the config `split_file_template_trailing` and replaces the `%s` with an asterisk to find all the related files.

If you have multiple files which are not split with this tool, please name them as:
`_chunk_.csv`.

Example:
- `dummy_chunk_0001.csv`
- `dummy_chunk_0002.csv`
- `dummy_chunk_0003.csv`

### Generate new name lists

Firstnames:
```bash
python main.py --gen-firstnames
```

Lastnames:
```bash
python main.py --gen-lastnames
```

# Performance

## Create Dummy Data
Create dummy data is a functionality to create test samples.
It uses the defined cores in the configuration and pushes all data first into the RAM.
This process requires big machines with a large amount of RAM if you want to create really big data.

To create 50.000.000 records a minimum of 50 GB RAM is recommended.
A regular notebook with 32 GB RAM can create 1.000.000 records in ~7 minutes (tested on AMD Ryzen 4700X).

The processing of the data is a bit faster.
In the default configuration (as-it-is) the tool can process on an AMD Ryzen 4700X with 32 GB RAM:
- 1.000.000 records in ~2,5 minutes
- 10.000.000 records in ~27 minutes.

The performance is more or less linear, which means, if you want to process 100.000.000 records, you need ~270 minutes.

If new rules should be used, try to avoid regular expressions, because they are extremely slow.
Each rule decreases also the performance.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ragin-lundf/csv_pseudonomizer

Awesome Lists containing this project

README