Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bongomin/etl-pipeline-application
an ETL pipeline application that uses Python and a PostgreSQL database.
https://github.com/bongomin/etl-pipeline-application
Last synced: 17 days ago
JSON representation
an ETL pipeline application that uses Python and a PostgreSQL database.
- Host: GitHub
- URL: https://github.com/bongomin/etl-pipeline-application
- Owner: bongomin
- Created: 2023-08-15T02:51:32.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-08-16T05:59:17.000Z (over 1 year ago)
- Last Synced: 2024-11-10T16:12:35.163Z (3 months ago)
- Language: Python
- Size: 11.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# etl-pipeline-application
etl-pipeline-application implements an ETL (Extract, Transform, Load) pipeline for processing health-related data from an external API `https://ghoapi.azureedge.net/api/Dimension` and storing it in a PostgreSQL database.
# etl-pipeline-application Overview
- In this Python script application, I've designed an ETL (Extract, Transform, Load) process to efficiently retrieve, transform, and store health-related data from a RESTful API into a PostgreSQL database.
To maintain security, sensitive data such as API URLs and database credentials are loaded from a .env file using the dotenv library. The script employs the requests library to extract paginated data from the API,
which is then transformed into a standardized format using the transform_data function. Validation of essential fields in the transformed data is ensured by the validate_data function. For the loading phase,
the script establishes a connection to the database using psycopg2, creates a table (if it doesn't exist), and then inserts validated data into it. The main function orchestrates these steps,
including resuming extraction from a saved state. Delays between API requests prevent overwhelming the API. The script incorporates error handling to address issues during the ETL process,
enhancing robustness and data reliability.## Table of Contents
- [Introduction](#introduction)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Configuration](#configuration)
- [Usage](#usage)## Introduction
The ETL pipeline script extracts data from an external API, transforms it into a suitable format, validates the data, and loads it into a PostgreSQL database table. The pipeline is designed to handle large datasets and supports resuming the extraction process from the last saved state.
## Prerequisites
- Python 3.x
- PostgreSQL database
- API accessURL
- [dotenv](https://pypi.org/project/python-dotenv/) package (`pip install python-dotenv`)## Installation
1. Clone the repository.
- git clone [email protected]:bongomin/etl-pipeline-application.git
- cd etl-pipeline-application
2. Install the required packages by running: `pip install -r requirements.txt`## Configuration
1. Create a `.env` file in the same directory as the script.
2. Add the following environment variables to the `.env` file:```ini
API_URL=
DB_NAME=
DB_USER=
DB_PASSWORD=
DB_HOST=## Usage
- Run the script using the following command:
- `python gho_etl.py``- Run the test by running the following command
- python `test_etl.py``## Developer
- Bongomin Daniel## Screen Shoots
- running Program- Data Saved in the Database