https://github.com/fvaleye/metadata-guardian
Provide an easy way with Python to protect your data sources by searching its metadata. 🛡️
https://github.com/fvaleye/metadata-guardian
data dataengineering dataset datastructures metadata metadata-driven metadata-extraction metadata-information metadata-management metadata-parser pii-detection
Last synced: 28 days ago
JSON representation
Provide an easy way with Python to protect your data sources by searching its metadata. 🛡️
- Host: GitHub
- URL: https://github.com/fvaleye/metadata-guardian
- Owner: fvaleye
- License: apache-2.0
- Created: 2021-10-12T09:55:19.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2026-01-15T10:02:37.000Z (about 2 months ago)
- Last Synced: 2026-01-15T15:44:41.470Z (about 2 months ago)
- Topics: data, dataengineering, dataset, datastructures, metadata, metadata-driven, metadata-extraction, metadata-information, metadata-management, metadata-parser, pii-detection
- Language: Python
- Homepage: https://fvaleye.github.io/metadata-guardian/python
- Size: 16.7 MB
- Stars: 18
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.adoc
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
README
image::logo.png[Metadata Guardian logo]
image:https://github.com/fvaleye/metadata-guardian/actions/workflows/python_build.yml/badge.svg[![python-build, link=https://github.com/fvaleye/metadata-guardian/actions/workflows/python_build.yml]
image:https://github.com/fvaleye/metadata-guardian/actions/workflows/rust_build.yml/badge.svg[![rust-build, link=https://github.com/fvaleye/metadata-guardian/actions/workflows/rust_build.yml]
image:https://img.shields.io/badge/docs-python-blue.svg?style=flat-square[Docs,link=https://fvaleye.github.io/metadata-guardian/python]
image:https://img.shields.io/pypi/v/metadata_guardian.svg?style=flat-square)[Pypi, link=https://pypi.org/project/metadata-guardian/]
== 📌 Overview
Metadata Guardian is a Python package that provides an easy way to protect your data sources by searching its metadata.
By searching with data rules, it will detect what you are looking to protect.
Using Rust, it makes blazing fast multi-regex matching.
Read more in this https://medium.com/@florian.valeye/metadata-guardian-protect-your-data-by-searching-its-metadata-fe479c24f1b1[article].
== 📦 Where to get it
```sh
# Install all the data sources
pip install 'metadata_guardian[all]'
```
```sh
# Install one or more data sources from the list
pip install 'metadata_guardian[snowflake,avro,aws,gcp,deltalake,kafka_schema_registry,mysql]'
```
== 📜 Data Rules
The available data rules are here: *https://github.com/fvaleye/metadata-guardian/blob/main/python/metadata_guardian/rules/pii_rules.yaml[PII]* and *https://github.com/fvaleye/metadata-guardian/blob/main/python/metadata_guardian/rules/inclusion_rules.yaml[INCLUSION]*.
But you could also your custom data rules to suit your needs.
== 📊 Data Sources
=== Local
- Parquet
- ORC
- AVRO
- AVRO Schema
- Arrow
=== External
- AWS: Athena and Glue
- Deltalake
- GCP: BigQuery
- Snowflake
- MySQL
- Kafka Schema Registry
== 🔎 Usage
With available Data Rules:
```python
from metadata_guardian import (
AvailableCategory,
ColumnScanner,
DataRules,
)
from metadata_guardian.source import MySQLSource
source = MySQLSource(
user="root",
password="12345678",
host="localhost",
)
data_rules = DataRules.from_available_category(category=AvailableCategory.PII)
column_scanner = ColumnScanner(data_rules=data_rules)
with source:
report = column_scanner.scan_external(
source,
database_name="sequelmovie",
include_comment=True,
)
report.to_console()
```
With custom Data Rules:
```python
from metadata_guardian import (
AvailableCategory,
ColumnScanner,
DataRule,
DataRules,
)
from metadata_guardian.source import MySQLSource
source = MySQLSource(
user="root",
password="12345678",
host="localhost",
)
category = "example"
data_rule = DataRule(
rule_name="example_rule_name",
regex_pattern="\b(test|example)\b",
documentation="example_test",
)
data_rules = [data_rule]
data_rules = DataRules.from_new_category(category=category, data_rules=data_rules)
column_scanner = ColumnScanner(
data_rules=data_rules, progression_bar_disabled=False
)
with source:
report = column_scanner.scan_external(
source,
database_name="sequelmovie",
include_comment=True,
)
report.to_console()
```
== 🛡️ Licence
https://raw.githubusercontent.com/fvaleye/metadata-guardian/main/LICENSE.txt[Apache License 2.0]
== 📚 Documentation
The documentation is hosted here: https://fvaleye.github.io/metadata-guardian/python/