https://github.com/tata-fs4/data_governance
Data governance framework with cataloging, access control, lineage tracking and data quality pipelines for medallion lakehouses
https://github.com/tata-fs4/data_governance
best-practices compliance data-catalog data-governance data-lineage data-quality dbt lgpd master-data-management medallion-architecture privacy python security
Last synced: 3 days ago
JSON representation
Data governance framework with cataloging, access control, lineage tracking and data quality pipelines for medallion lakehouses
- Host: GitHub
- URL: https://github.com/tata-fs4/data_governance
- Owner: tata-fs4
- Created: 2025-10-29T02:59:45.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-12-09T21:54:27.000Z (6 months ago)
- Last Synced: 2026-06-10T02:38:11.923Z (3 days ago)
- Topics: best-practices, compliance, data-catalog, data-governance, data-lineage, data-quality, dbt, lgpd, master-data-management, medallion-architecture, privacy, python, security
- Language: Python
- Homepage:
- Size: 25.4 KB
- Stars: 2
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Governance Project
This project demonstrates a data governance pipeline focused on compliance, featuring data cataloging, access control, lineage tracking, and automated quality validations.
The goal is to align data delivery with **LGPD** and **ISO 27001** guidelines.
## Structure
```
data_governance/
├── config/
│ └── policies.yaml # Regulatory policies, access rules, quality rules
├── data/
│ ├── raw/ # Mocked source data
│ └── processed/ # Governed outputs produced by the pipeline
├── logs/ # Audit records
├── src/
│ └── data_governance/ # Framework implementation
│ ├── access_control.py
│ ├── catalog.py
│ ├── lineage.py
│ ├── pipeline.py
│ ├── policies.py
│ └── quality.py
└── README.md
```
## Features
- **Data Catalog**
Registers assets with metadata, sensitivity classification, tags, and regulatory compliance attributes.
- **Access Control**
Enforces role-based policies aligned with LGPD and ISO 27001 requirements.
- **Lineage Tracking**
Captures end-to-end transformations, mapping inputs and outputs for auditability.
- **Automated Data Quality**
Validates consent recency, email format, monetary values, and other business rules.
- **Auditing**
Generates JSON logs containing evidence of execution, loaded policies, and identified quality issues.
## Running the Pipeline
1. (Optional) Create a virtual environment and install dependencies:
```bash
pip install pandas pyyaml
2. Execute o pipeline:
```bash
python -m data_governance.src.data_governance.pipeline
```
### Execução programática
```python
from data_governance.src.data_governance.pipeline import run_pipeline
audit_log = run_pipeline("data_governance")
print(audit_log)
```
## Compliance
### LGPD
- The pipeline filters customers without valid consent and tracks the latest update timestamp.
### ISO 27001
- Documented access policies and lineage records provide evidence for security and compliance audits.
## Next Steps
- Integrate with a corporate catalog (e.g., Apache Atlas).
- Automate ingestion of policies from a GRC platform.
- Extend quality rules with statistical profiling and anomaly detection.