https://github.com/nathadriele/datamart-tables-data-type-validation
This project is a Data Engineering solution implemented to validate the data types of columns in PostgreSQL tables in a DataMart. It aims to validate whether the data stored in tables conforms to the expected data types, improving data integrity and reliability.
https://github.com/nathadriele/datamart-tables-data-type-validation
data-engineering datamart mageai monitoring-scripts python validation
Last synced: 2 months ago
JSON representation
This project is a Data Engineering solution implemented to validate the data types of columns in PostgreSQL tables in a DataMart. It aims to validate whether the data stored in tables conforms to the expected data types, improving data integrity and reliability.
- Host: GitHub
- URL: https://github.com/nathadriele/datamart-tables-data-type-validation
- Owner: nathadriele
- Created: 2024-07-27T17:08:13.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-07-28T19:27:17.000Z (10 months ago)
- Last Synced: 2025-01-29T19:11:51.048Z (4 months ago)
- Topics: data-engineering, datamart, mageai, monitoring-scripts, python, validation
- Language: Python
- Homepage:
- Size: 16.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Data Mart Tables Data Type Validation
### Overview
This project is a Data Engineering solution implemented in Mage.ai locally for validating the data types of columns in PostgreSQL tables within a Data Mart. It ensures that data stored in the tables conforms to the expected data types, enhancing data integrity and reliability. This validation is crucial for maintaining high-quality data, which is essential for accurate data analysis and reporting.## Getting Started
### Prerequisites
Ensure you have the following software installed:- Python 3.7 or higher
- PostgreSQL
- A tool to manage virtual environments (e.g., virtualenv or conda)### Setup
1. Clone the repository:```py
git clone https://github.com/nathadriele/datamart-tables-data-type-validation.git
```2. Create and activate a virtual environment:
```py
virtualenv venv
source venv/bin/activate
```3. Install the required dependencies:
```py
pip install -r requirements.txt
```4. Set up environment variables:
Configure your environment variables for PostgreSQL connection details. You can use a .env file or set them directly in your environment.
### Running the Validation Script
Execute the data validation script using:```py
datamart_data_type_validation.py
```### Functions
- `connect_to_postgres()`
- Establishes and returns a connection to the PostgreSQL database.
- Uses secrets to fetch connection details.
- Importance: Ensures secure and reliable connection to the database.- `validate_data_types(cursor, table_name, columns, limit=1000)`
- Validates column data types in the specified table.
- Returns rows with data type mismatches.
- Importance: Ensures data integrity by verifying that the data types in the database match the expected types.- `validate_type(value, expected_type)`
- Validates if the value matches the expected data type.
- Importance: Provides a mechanism to check each value against its expected data type, enhancing data accuracy.- `send_event_to_notifier(event_name, status)`
- Sends a notification about the validation event to an external service.
- Importance: Alerts the system about validation results, allowing for proactive data management.- `check_table_updates(postgres_conn)`
- Checks for updates in specified tables and validates data types.
- Sends validation results as notifications.
- Importance: Automates the validation process, ensuring continuous data quality monitoring.### Configuration
The tables and columns to be validated are defined in the check_table_updates function. You can modify the tables_columns dictionary to include your specific tables and columns.### Logging
The script uses the Python logging module for logging information, errors, and debug messages. This helps in tracking the script's execution and identifying issues.### Testing
The script includes basic testing functions to ensure the data loader function works as expected.### Metadata Configuration
The metadata.yaml file contains the configuration for the Mage.ai block used in this project. It defines the block's properties, execution type, and other settings that control how the data validation process is executed.### Trigger Configuration
The triggers.yaml file defines the schedule and settings for triggering the data validation pipeline. It specifies the frequency (@daily), start time, and other settings necessary for automating the execution of the validation process.### Contribution to Data Engineering
This project plays a vital role in Data Engineering by ensuring that the data stored in PostgreSQL tables conforms to the expected data types. Data Engineers rely on accurate and reliable data to build data pipelines, perform ETL processes, and generate meaningful insights.
By validating data types:
- Data Integrity: Ensures that the data is consistent and conforms to the defined schema.
- Error Reduction: Identifies and mitigates data type mismatches early, reducing errors in downstream processes.
- Automated Monitoring: Automates data validation, allowing Data Engineers to focus on more complex tasks.
- Proactive Alerts: Sends notifications about validation results, enabling timely interventions to maintain data quality.