https://github.com/weskal/vexus_pipeline

Automated pipeline for generating, ingesting, and validating realistic data, designed to simulate real-world workflows with scheduling, data quality checks, and version control.
https://github.com/weskal/vexus_pipeline

airflow data pipeline python sqlserver workflow

Last synced: 6 months ago
JSON representation

Automated pipeline for generating, ingesting, and validating realistic data, designed to simulate real-world workflows with scheduling, data quality checks, and version control.

Host: GitHub
URL: https://github.com/weskal/vexus_pipeline
Owner: Weskal
Created: 2025-05-28T01:38:17.000Z (about 1 year ago)
Default Branch: dev
Last Pushed: 2025-05-30T00:36:10.000Z (about 1 year ago)
Last Synced: 2025-06-06T05:26:30.084Z (about 1 year ago)
Topics: airflow, data, pipeline, python, sqlserver, workflow
Language: Python
Homepage:
Size: 6.84 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

# Vexus Data Pipeline
Automated Data Pipeline with Airflow, Python, and SQL Server

## Description

This project implements an automated pipeline for generating, ingesting, and storing simulated sales order data for a fictitious clothing store called Vexus. It uses Python to create the data and send it to the database, Apache Airflow for orchestration, and SQL Server for storage and querying.

## Features

- Daily generation of simulated data based on real products from the SQL Server database.
- Storage of data in CSV files organized by date.
- Automatic ingestion of CSV files into the SQL Server database.
- Data validation to avoid duplication and ensure integrity.
- Pipeline orchestrated and scheduled via Apache Airflow.

## Technologies Used

- Python
- Apache Airflow
- SQL Server
- pandas
- pyodbc

## Project Structure
- scripts
- `__init__.py`
- `db_utils.py`
- `fake_data_generator.py`
- `ingestion.py`
- .env
- airflow
- raw_data
- .gitignore
- readme.md
- requirements.txt

## How to Run

### 1. Create and configure the `.env` file with your database credentials:

```env
DB_SERVER=your_server
DB_DATABASE=your_database
DB_USER=your_user
DB_PASSWORD=your_password
```

### 2. Inside your project folder, create and activate a Python virtual environment:
```bash
python -m venv venv
source venv/bin/activate # Linux/Mac
.\venv\Scripts\activate # Windows
```

### 3. Install dependencies:
```bash
pip install -r requirements.txt
```
### 4. Inside the virtual environment, run Airflow:
```bash
airflow standalone
```

## 🚀 Next Steps
🔁 Integration with Jenkins for automated deployment (CI/CD)

✅ Ensure idempotency in executions

🔍 Implement data quality and consistency checks

📈 Monitoring and alerting for pipeline failures

--------------------

## 📢 Contribution
Feel free to open issues or submit PRs with improvements, ideas, or fixes.

## 🧑‍💻 Author
Developed by __Gabriel Paliato__

# 📲 Connect with me on LinkedIn
I am always open to exchanging experiences, sharing knowledge, and exploring new opportunities in Data Engineering, Infrastructure, and DevOps.

Let's connect!
https://www.linkedin.com/in/gabriel-paliato-49467b211/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/weskal/vexus_pipeline

Awesome Lists containing this project

README