https://github.com/patrickdavies100/pipeline38
An application to automate the creation and execution of SQL queries.
https://github.com/patrickdavies100/pipeline38
data pandas-dataframe pipeline postgresql psycopg2 sqlalchemy
Last synced: about 2 months ago
JSON representation
An application to automate the creation and execution of SQL queries.
- Host: GitHub
- URL: https://github.com/patrickdavies100/pipeline38
- Owner: PatrickDavies100
- License: gpl-3.0
- Created: 2024-10-23T10:44:14.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-31T20:21:21.000Z (over 1 year ago)
- Last Synced: 2025-02-09T08:34:52.697Z (over 1 year ago)
- Topics: data, pandas-dataframe, pipeline, postgresql, psycopg2, sqlalchemy
- Language: Python
- Homepage:
- Size: 39.1 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Pipeline 38
This is a follow- up to Pipeline 37. The aim is to create a serialisation format of a data pipeline. There is one key change from Pipeline 37:
Data will be manipulated using PostgreSQL commands rather than in a Pandas Dataframe format.
**Technologies used:**
PostgreSQL 17.0,
Python 3.13.0,
Pandas,
SQLAlchemy 2.0.36,
psycopg 2 2.9.10,
pgAdmin 4
PyCharm
**Objectives:**
1. Create tools for automated data process including cleaning, transformation, and processing.
2. The application can generate a working serialisation format of a pipeline.
3. Improve performance for large datasets with use of PostgreSQL queries.
**Goal:**
Improve my workflow for large datasets to create useful analysis for Tableau.
**Architecture**
The basic structure of this project has a few simple elements. There is a connection to a PostgreSQL database that uses LocalSettings (this file is not on Github). The user can enter commands, the args are passed to the relevant function in SQLFunctions, and the query is constructed there and passed back to 'Connection' to be executed. These commands will include both changes to the data being examined and the creation of new tables. Every time a command is successfully executed, a row is also added to a DF called Query DF that is recording the completed instructions.
This DF is a record of the data processing. It can then be saved, loaded, or exported so that the user can automate the steps for another file.
There is a second dataframe (Derived DF) that stores the results of user commands, IE derived values that are not added to the original dataset. In this way the user is able to create a table of derived data and perform different operations on it directly.