https://github.com/openraven/mockingbird
A toolset to test data classification engines that generates mock data in various file formats, sizes and data profiles.
https://github.com/openraven/mockingbird
data-classification synthetic-data-generation
Last synced: 21 days ago
JSON representation
A toolset to test data classification engines that generates mock data in various file formats, sizes and data profiles.
- Host: GitHub
- URL: https://github.com/openraven/mockingbird
- Owner: openraven
- License: apache-2.0
- Created: 2021-01-22T23:49:59.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2024-01-12T20:36:14.000Z (about 2 years ago)
- Last Synced: 2025-09-25T04:56:18.399Z (5 months ago)
- Topics: data-classification, synthetic-data-generation
- Language: Python
- Homepage:
- Size: 564 KB
- Stars: 44
- Watchers: 6
- Forks: 6
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
README
# Mockingbird: Generate mock documents for data classification
## About
Mockingbird is a Python library for generating mock documents in various formats. It accepts user-defined data, and embeds it into documents generated in many different formats. Developers can use Mockingbird to quickly generate datasets, with particular use for validating the efficacy of a data classification software.
## Installation
The easiest way to install Mockingbird is by using `pip`:
`pip install mockingbird`
For local development, clone the repository and run `pip install .`
## Getting Started
Mockingbird can run as a functional Python library or as a CLI.
### CLI Usage
Once installed with pip, unix-like systems can use the command `mockingbird_cli --h` to access Mockingbird's
command line interface. Some sample CLI calls are:
```
mockingbird_cli --type dry -o ./output/dry_test/
mockingbird_cli --type csv -i ./samples/csv_sample.csv -o ./output/csv/
mockingbird_cli --type csv_curl -i -o ./output/csv_curl/
mockingbird_cli --type mockaroo -i ./samples/sample_schema.json --mockaroo_api -o ./output/mockaroo
```
### As a Python Library
#### Starting from Code
Mockingbird functions as a fully functional Python library. A basic example generating documents using
mock-data is demonstrated below. In this example, key-value pairs are inserted as strings mapping to a list of strings.
```
from mockingbird import Mockingbird
# Spawn a new Mockingbird session
fab = Mockingbird()
# Set which file extensions to output
fab.set_file_extensions(["html", "docx", "yaml", "xlsx", "odt"])
# Input the data we want to test / inject into the documents
fab.add_sensitive_data(keyword="ssn", entries=["000-000-0000", "999-999-9999"])
fab.add_sensitive_data(keyword="dob", entries=["01/01/1991", "02/02/1992"])
# Generate and save the fabricated documents
fab.save(save_path="./output_basic/")
fab.dump_meta_data(output_file="./output_basic/meta_data.json")
```
#### Starting from CSV
Mockingbird can be started using a CSV file, treating the column headers as keywords, and the remaining rows as entries.
The CSV's are expected to be structured as the following,
```
FILE: mockingbird_data.csv
ssn, dob
000-000-000, 01/01/1991
999-999-999, 02/02/1992
```
```
from mockingbird.mb_wrappers import MockingbirdFromCSV
# This effectively loads files from the csv and generates a session using each column
fab = MockingbirdFromCSV("csv_sample.csv")
fab.set_all_extensions()
fab.save(save_path="./output_csv/")
fab.dump_meta_data(output_file="./output_csv/meta_data.json")
```
Optionally, multiple keywords can be defined in the CSV header file, which Mockingbird will split up into separate
keywords. For example, rather than just testing the keyword ```ssn```, we can test ```ssn``` and ```social security number```.
Multiple keywords can be defined in the CSV file by using `;` as a delimiter.
For example,
```
FILE: mockingbird_data.csv
ssn;social security number,dob;date of birth;birth
000-000-000, 01/01/1991
999-999-999, 02/02/1992
```
This will generate documents for each keyword in each column header.
#### Starting Using Mockaroo
Using a Mockaroo API key, we can request mocked data using json requests from Mockaroo's servers. Currently, the request has to be saved to
a json file on disk, and loaded during runtime. More documentation can be found at [Mockaroo's Website](https://www.mockaroo.com/api/docs), but below
is a json-example.
```
FILE: mockaroo_request.json
[
{
"name": "ssn;social security;social",
"type": "SSN"
},
{
"name": "cc;credit card",
"type": "Credit Card #"
},
{
"name": "phone;phone-number;number",
"type": "Phone"
},
{
"name": "name;fullname;full name",
"type": "Full Name"
}
]
```
In code, Mockingbird can use this request as a json-payload,
```
import json
from mockingbird.mb_wrappers import MockingbirdFromMockaroo
with open("mockaroo_request.json") as json_file:
schema_request = json.load(json_file)
fab = MockingbirdFromMockaroo(api_key="MOCKAROO_API_KEY", schema_request=schema_request)
fab.set_all_extensions()
fab.save(save_path="./output_mockaroo/")
fab.dump_meta_data(output_file="./output_mockaroo/meta_data.json")
```
## License
Licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for the full license text.