Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/fistgang/processemployeesetl
An Apache Airflow ETL pipeline to process employee data by downloading, validating, transforming, and loading it into a PostgreSQL database.
https://github.com/fistgang/processemployeesetl
airflow airflow-docker-containers dag docker-compose etl-pipeline python
Last synced: about 2 months ago
JSON representation
An Apache Airflow ETL pipeline to process employee data by downloading, validating, transforming, and loading it into a PostgreSQL database.
- Host: GitHub
- URL: https://github.com/fistgang/processemployeesetl
- Owner: FistGang
- License: mit
- Created: 2024-06-11T06:42:25.000Z (6 months ago)
- Default Branch: master
- Last Pushed: 2024-06-11T13:06:23.000Z (6 months ago)
- Last Synced: 2024-06-11T14:34:04.811Z (6 months ago)
- Topics: airflow, airflow-docker-containers, dag, docker-compose, etl-pipeline, python
- Language: Python
- Homepage:
- Size: 13.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Process Employees ETL Pipeline
An Apache Airflow ETL pipeline to process employee data by downloading, validating, transforming, and loading it into a PostgreSQL database.
## Table of Contents
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)
- [ETL Pipeline Overview](#etl-pipeline-overview)
- [Contributing](#contributing)
- [License](#license)## Prerequisites
- **Python 3.9+**
- **Apache Airflow 2.8+**## Installation
1. **Clone the repository**:
```sh
git clone https://github.com/FistGang/ProcessEmployeesETL.git
cd ProcessEmployeeETL.git
```2. **Set up a virtual environment** (optional):
```sh
python3 -m venv venv
source venv/bin/activate
```3. **Install dependencies**:
```sh
pip install -r requirements.txt
```4. **Configure Airflow**:
- Initialize the Airflow database:```sh
airflow db init
```- Create a PostgreSQL connection in Airflow with ID `pg_conn`.
- Connection Id: `pg_conn`
- Connection Type: postgres
- Host: postgres
- Schema: airflow
- Login: airflow
- Password: airflow
- Port: 54325. **Start Airflow**:
```sh
airflow webserver --port 8080
airflow scheduler
```6. **Deploy the DAG**:
- Place `process_employees.py` in your Airflow DAGs folder.## Using Containers
```bash
# Download the docker-compose.yaml file
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'# Make expected directories and set an expected environment variable
mkdir -p ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)" > .env# Initialize the database (optional)
docker-compose up airflow-init# Start up all services
docker-compose up
```Then create a PostgreSQL connection in Airflow with ID `pg_conn` .
## Usage
1. **Trigger the DAG**:
- Open the Airflow web UI at `http://localhost:8080`.
- Activate and manually trigger the `process_employees_etl_pipeline` DAG.2. **Monitor**:
- Use the Airflow UI to monitor DAG execution and check logs for errors.## ETL Pipeline Overview
1. **Create Tables**:
- `employees`: Main table.
- `employees_temp`: Temporary table for data processing.2. **Get Data**:
- Download and validate the employee CSV file.3. **Transform Data**:
- Add a processed timestamp, format columns, and categorize leave.4. **Load Data**:
- Load transformed data into `employees_temp`.5. **Merge Data**:
- Merge data from `employees_temp` into `employees`.## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
---
For questions or issues, please open an issue on GitHub.