Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bongomin/etl-pipeline-application

an ETL pipeline application that uses Python and a PostgreSQL database.
https://github.com/bongomin/etl-pipeline-application

Last synced: 17 days ago
JSON representation

an ETL pipeline application that uses Python and a PostgreSQL database.

Host: GitHub
URL: https://github.com/bongomin/etl-pipeline-application
Owner: bongomin
Created: 2023-08-15T02:51:32.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2023-08-16T05:59:17.000Z (over 1 year ago)
Last Synced: 2024-11-10T16:12:35.163Z (3 months ago)
Language: Python
Size: 11.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# etl-pipeline-application

etl-pipeline-application implements an ETL (Extract, Transform, Load) pipeline for processing health-related data from an external API `https://ghoapi.azureedge.net/api/Dimension` and storing it in a PostgreSQL database.

# etl-pipeline-application Overview

- In this Python script application, I've designed an ETL (Extract, Transform, Load) process to efficiently retrieve, transform, and store health-related data from a RESTful API into a PostgreSQL database.
To maintain security, sensitive data such as API URLs and database credentials are loaded from a .env file using the dotenv library. The script employs the requests library to extract paginated data from the API,
which is then transformed into a standardized format using the transform_data function. Validation of essential fields in the transformed data is ensured by the validate_data function. For the loading phase,
the script establishes a connection to the database using psycopg2, creates a table (if it doesn't exist), and then inserts validated data into it. The main function orchestrates these steps,
including resuming extraction from a saved state. Delays between API requests prevent overwhelming the API. The script incorporates error handling to address issues during the ETL process,
enhancing robustness and data reliability.

## Table of Contents

- [Introduction](#introduction)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Configuration](#configuration)
- [Usage](#usage)

## Introduction

The ETL pipeline script extracts data from an external API, transforms it into a suitable format, validates the data, and loads it into a PostgreSQL database table. The pipeline is designed to handle large datasets and supports resuming the extraction process from the last saved state.

## Prerequisites

- Python 3.x
- PostgreSQL database
- API accessURL
- [dotenv](https://pypi.org/project/python-dotenv/) package (`pip install python-dotenv`)

## Installation

1. Clone the repository.
- git clone [email protected]:bongomin/etl-pipeline-application.git
- cd etl-pipeline-application
2. Install the required packages by running: `pip install -r requirements.txt`

## Configuration

1. Create a `.env` file in the same directory as the script.
2. Add the following environment variables to the `.env` file:

```ini
API_URL=
DB_NAME=
DB_USER=
DB_PASSWORD=
DB_HOST=

## Usage
- Run the script using the following command:
- `python gho_etl.py``

- Run the test by running the following command
- python `test_etl.py``

## Developer
- Bongomin Daniel

## Screen Shoots
- running Program
Screenshot 2023-08-16 at 8 40 20 AM

- Data Saved in the Database
Screenshot 2023-08-16 at 8 39 50 AM