https://github.com/jbangtson/wedge_project

In this data engineering project, I analyzed point-of-sale (POS) 🏪 data from the Wedge Co-Op in Minneapolis, spanning January 2010 to January 2017. The dataset captures transaction-level details from a member-owned cooperative, with 75% of transactions generated by member-owners, enabling comprehensive shopping pattern analysis.
https://github.com/jbangtson/wedge_project

dataengineering gbq python

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/jbangtson/wedge_project
Owner: JBangtson
Created: 2024-09-12T23:01:01.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-12-06T18:38:04.000Z (7 months ago)
Last Synced: 2024-12-19T18:53:35.602Z (7 months ago)
Topics: dataengineering, gbq, python
Language: Jupyter Notebook
Homepage:
Size: 6.75 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        
---

# Table of Contents 🍎

1. [Extract: Raw Data Exploration and Metadata](01_explore_wedge.ipynb)

1. [Transform & Load: Cleaning and Uploading to GBQ](02_to_the_cloud.ipynb)

1. [Owner Queries](03_GBQ_owner_query.ipynb)

1. [Creating SQLite DB and Tables from GBQ Queries](04_building_summary_tables.ipynb)

# Summary 

In this data engineering project, I created an ETL pipeline from point-of-sale (POS) 🏪 data from the Wedge Co-Op in Minneapolis, spanning January 2010 to January 2017. The dataset captures transaction-level details from a member-owned cooperative, with 75% of transactions generated by member-owners, enabling comprehensive shopping pattern analysis.

The meaning of this file is to summarize the Wedge Project and to check for accuracy in the ETL process. The first section of the project was extracting the data from zip files and creating metadata for exploration. The second section of the project consisted of creating data cleaning functions specific to the project, and transforming and loading the data into GBQ. In the third section, text files were created from GBQ owner-specific queries. In the final section, data was downloaded from GBQ using queries, then the data was loaded into a SQL db. 

### Task 1

* Files for this task: 

 

`explore_wedge.ipynb`: 

The notebook is structured to facilitate the exploration, summarization, and cleaning of a dataset related to the Wedge project, with a focus on handling CSV files efficiently using Polars.

 

`to_the_cloud.ipynb`: 

The notebook is designed to automate the process of loading, cleaning, and uploading data to BigQuery, ensuring that the data is in the correct format and structure for analysis.

### Task 2

* File for this task: 

 

`GBQ_owner_query.ipynb`: 

This notebook automates the process of querying transaction data from BigQuery, handling card owner information, and saving the results for further analysis. It emphasizes data retrieval, processing, and error management in the context of working with large datasets.

 

	

### Task 3

* File for this task: 

 

`building_summary_tables.ipynb`: 

 The notebook automates the process of querying and processing transaction data from BigQuery, focusing on card owner information, and emphasizes data retrieval, processing, and error management in handling large datasets.

 

## Query Comparison Results

Assignment: Fill in the following table with the results from the 

queries contained in `gbq_assessment_query.sql`. You only

need to fill in relative difference on the rows where it applies. 

When calculating relative difference, use the formula 

` (your_results - john_results)/john_results)`. 

|  Query  |  Your Results  |  John's Results | Difference | Rel. Diff | 

|---|---|---|---|---|

| Total Rows  |  85,760,124 | 85,760,139  | -15  | 15  |

| January 2012 Rows  |  1,070,907 |  1,070,907 | 0  | 0  |

| October 2012 Rows  | 1,029,592  | 1,029,592  | 0  |  0 |

| Month with Fewest  |  February (2) | Yes  | Yes/No  | NA  |

| Num Rows in Month with Fewest  |  6,556,769 |  6,556,770 | -1 | 1  |

| Month with Most  | May  | Yes  | Yes/No  | NA  |

| Num Rows in Month with Most  |  7,578,371 | 7,578,372  | -1  |  1 |

| Null_TS  | 485,472  | 7,123,792  | -6,338,320 | -6,338,320  |

| Null_DT  | 0  |  0 |  0 | 0  |

| Null_Local  |  234,839 |  234,843 | -6  | 6  |

| Null_CN  | 0  | 0  |  0 | 0  |

| Num 5 on High Volume Cards  |  14987 | Yes  | Yes/No  | NA  |

|  Num Rows for Number 5 | 460,625  | 460,630  | -5  | 5  |

| Num Rows for 18736  |  12,153 |  12,153 | 0  |  0 |

| Product with Most Rows  | Banana Organic  | Yes  | Yes/No  | NA  |

| Num Rows for that Product  |  908,637 |  908,639 | -2  | 2  |

| Product with Fourth-Most Rows  |  Avocado Hass Organic | Yes  | Yes/No  | NA  |

| Num Rows for that Product  |  456,771 | 456,771  | 0  | 0  |

| Num Single Record Products  |  2,741 |  2,769 | -28  | 28  |

| Year with Highest Portion of Owner Rows  | 2014  | Yes  | Yes/No  | NA |

| Fraction of Rows from Owners in that Year  | 75.91%  |  75.91%  |  0% |  0% |

| Year with Lowest Portion of Owner Rows  |  2011 | Yes  | Yes/No  | NA |

| Fraction of Rows from Owners in that Year  |  73.72% |  73.72% | 0%  | 0%  |

Note: I have such a large difference in Null_TS due to changing the strings to " " instead of NULL. I have 65,065,888 rows of " " in the trans_subtype column; in this context there is a 1,239,440 relative distance with John still having greater NULL_TS. 

## Reflections

Overall, The Wedge Project was exciting in working with __my__ first cloud database. The experience gave me confidence in my ability to do the ETL process. 

The process was messy. Some files were already clean, while others had no column names, Strings in columns with Float datatypes, or delimited with a semi-colon instead of commas, etc. 

I wanted to do each task in its loop. For example, I tried to clean and upload all the data to GBQ in a single loop (I hope it is fully automatic by the time this is due, as I wanted). But, sometimes, I would have a chunk of code running for 15 minutes for it to crash, and instead of tweaking the loop to start where it left off, it was easier to make a 'manual' section where I would manually select the file, then clean, and upload. Slowing down saved some money, but I lost time. 

After completing the tasks, I had a lot of messy code to clean up, and I found errors in cleaning, which created more mess. I am confident my errors are trivial, if there are any, but I'm still cleaning up and commenting on messy code.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jbangtson/wedge_project

Awesome Lists containing this project

README