https://github.com/fvaleye/metadata-guardian

Provide an easy way with Python to protect your data sources by searching its metadata. 🛡️
https://github.com/fvaleye/metadata-guardian

data dataengineering dataset datastructures metadata metadata-driven metadata-extraction metadata-information metadata-management metadata-parser pii-detection

Last synced: 4 months ago
JSON representation

Provide an easy way with Python to protect your data sources by searching its metadata. 🛡️

Host: GitHub
URL: https://github.com/fvaleye/metadata-guardian
Owner: fvaleye
License: apache-2.0
Created: 2021-10-12T09:55:19.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2026-01-15T10:02:37.000Z (5 months ago)
Last Synced: 2026-01-15T15:44:41.470Z (5 months ago)
Topics: data, dataengineering, dataset, datastructures, metadata, metadata-driven, metadata-extraction, metadata-information, metadata-management, metadata-parser, pii-detection
Language: Python
Homepage: https://fvaleye.github.io/metadata-guardian/python
Size: 16.7 MB
Stars: 18
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.adoc
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS

Awesome Lists containing this project

README

          image::logo.png[Metadata Guardian logo]

image:https://github.com/fvaleye/metadata-guardian/actions/workflows/python_build.yml/badge.svg[![python-build, link=https://github.com/fvaleye/metadata-guardian/actions/workflows/python_build.yml]

image:https://github.com/fvaleye/metadata-guardian/actions/workflows/rust_build.yml/badge.svg[![rust-build, link=https://github.com/fvaleye/metadata-guardian/actions/workflows/rust_build.yml]

image:https://img.shields.io/badge/docs-python-blue.svg?style=flat-square[Docs,link=https://fvaleye.github.io/metadata-guardian/python]

image:https://img.shields.io/pypi/v/metadata_guardian.svg?style=flat-square)[Pypi, link=https://pypi.org/project/metadata-guardian/]

== 📌 Overview

Metadata Guardian is a Python package that provides an easy way to protect your data sources by searching its metadata.

By searching with data rules, it will detect what you are looking to protect.

Using Rust, it makes blazing fast multi-regex matching.

Read more in this https://medium.com/@florian.valeye/metadata-guardian-protect-your-data-by-searching-its-metadata-fe479c24f1b1[article].

== 📦 Where to get it

```sh

# Install all the data sources

pip install 'metadata_guardian[all]'

```

```sh

# Install one or more data sources from the list

pip install 'metadata_guardian[snowflake,avro,aws,gcp,deltalake,kafka_schema_registry,mysql]'

```

== 📜 Data Rules

The available data rules are here: *https://github.com/fvaleye/metadata-guardian/blob/main/python/metadata_guardian/rules/pii_rules.yaml[PII]* and *https://github.com/fvaleye/metadata-guardian/blob/main/python/metadata_guardian/rules/inclusion_rules.yaml[INCLUSION]*.

But you could also your custom data rules to suit your needs.

== 📊 Data Sources

=== Local

- Parquet

- ORC

- AVRO

- AVRO Schema

- Arrow

=== External

- AWS: Athena and Glue

- Deltalake

- GCP: BigQuery

- Snowflake

- MySQL

- Kafka Schema Registry

== 🔎 Usage

With available Data Rules:

```python

from metadata_guardian import (

    AvailableCategory,

    ColumnScanner,

    DataRules,

)

from metadata_guardian.source import MySQLSource

source = MySQLSource(

        user="root",

        password="12345678",

        host="localhost",

    )

data_rules = DataRules.from_available_category(category=AvailableCategory.PII)

column_scanner = ColumnScanner(data_rules=data_rules)

with source:

    report = column_scanner.scan_external(

        source,

        database_name="sequelmovie",

        include_comment=True,

    )

    report.to_console()

```

With custom Data Rules:

```python

from metadata_guardian import (

    AvailableCategory,

    ColumnScanner,

    DataRule,

    DataRules,

)

from metadata_guardian.source import MySQLSource

source = MySQLSource(

        user="root",

        password="12345678",

        host="localhost",

    )

category = "example"

data_rule = DataRule(

rule_name="example_rule_name",

regex_pattern="\b(test|example)\b",

documentation="example_test",

)

data_rules = [data_rule]

data_rules = DataRules.from_new_category(category=category, data_rules=data_rules)

column_scanner = ColumnScanner(

data_rules=data_rules, progression_bar_disabled=False

)

with source:

    report = column_scanner.scan_external(

    source,

    database_name="sequelmovie",

    include_comment=True,

    )

    report.to_console()

```

== 🛡️ Licence

https://raw.githubusercontent.com/fvaleye/metadata-guardian/main/LICENSE.txt[Apache License 2.0]

== 📚 Documentation

The documentation is hosted here: https://fvaleye.github.io/metadata-guardian/python/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fvaleye/metadata-guardian

Awesome Lists containing this project

README