https://github.com/tilakapash/retail_data_pipeline
Datacamp Project consisting of Building a Retail Data Pipeline
https://github.com/tilakapash/retail_data_pipeline
Last synced: 3 months ago
JSON representation
Datacamp Project consisting of Building a Retail Data Pipeline
- Host: GitHub
- URL: https://github.com/tilakapash/retail_data_pipeline
- Owner: tilakapash
- Created: 2025-02-13T13:59:53.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-02-13T14:06:53.000Z (4 months ago)
- Last Synced: 2025-02-13T15:24:55.256Z (4 months ago)
- Language: Python
- Size: 6.84 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# retail_data_pipeline
## Datacamp Project consisting of Building a Retail Data Pipeline
### Project Description
Mastering data pipelines is essential for Data Engineers today. It involves extracting, transforming, and loading data — a fundamental task that ensures information flows smoothly.
In this project, you will work with retail data from a multinational retail corporation Walmart. You will retrieve data from different sources, like SQL and parquet; prepare the data using some transformation techniques, and finally load the data in an easy-to-access format.
### Project Instructions
Build a data pipeline using custom functions to extract, transform, aggregate, and load e-commerce data. The SQL query for grocery_sales and the extract() function have already been implemented for you.
To start the project, run the first two cells, then proceed with the following steps:
1. Implement a function named transform() with one argument, taking merged_df as input, filling missing numerical values (using any method of your choice), adding a column "Month", keeping the rows where the weekly sales are over $10,000 and drops the unnecessary columns. Ultimately, it should return a DataFrame and be stored as the clean_data variable.
2. Implement the function avg_weekly_sales_per_month with one argument (the cleaned data). This function will calculate the average monthly sales. For implementing this function you must select the "Month" and "Weekly_Sales" columns as they are the only ones needed for this analysis, then create a chain operation with groupby(), agg(), reset_index(), and round() functions, then group by the "Month" column and calculate the average monthly sales, then call reset_index() to start a new index order and finally round the results to two decimal places.
3. Create a function called load() that takes the cleaned and aggregated DataFrames, and their paths, and saves them as clean_data.csv and agg_data.csv respectively, without an index.
4. Lastly, define a validation() function that checks whether the two csv files from the load() exist in the current working directory.
### How to Approach the Project
#### 1. Extracting the data
Extract the data from PostgreSQL and parquet file.#### 2. Transforming the data
Perform imputation, filtering, and cleaning.#### 3. Preliminary analysis of the sales data
After cleaning the data, you will conduct a preliminary analysis.#### 4. Loading and validating the data
Final step is to store your transformed data and validate that it was stored correctly.