https://github.com/leftcoastnerdgirl/big_data

This project uses PySpark and SQL to analyze Big Data.
https://github.com/leftcoastnerdgirl/big_data

jupyter-notebook pandas-python pyspark spark-sql sparksession sql structured-query-language

Last synced: 3 months ago
JSON representation

This project uses PySpark and SQL to analyze Big Data.

Host: GitHub
URL: https://github.com/leftcoastnerdgirl/big_data
Owner: LeftCoastNerdGirl
Created: 2024-03-29T19:19:04.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-08-25T20:48:19.000Z (9 months ago)
Last Synced: 2025-01-04T02:26:01.074Z (5 months ago)
Topics: jupyter-notebook, pandas-python, pyspark, spark-sql, sparksession, sql, structured-query-language
Language: Jupyter Notebook
Homepage: https://extension.berkeley.edu/search/publicCourseSearchDetails.do?method=load&courseId=35106003
Size: 44.9 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Using PySpark to analyze large data sets

# Data prep

- Imported the tools needed for the analysis.
- Created a Spark session to enable the work.
- Read the AWS data file and formatted in a dataframe.

# Using a temp view

- Created a temp view to improve processing time of the large data set.
- Created 4 SQL queries to answer the following questions:
- What is the average price for a four-bedroom house sold for each year? Round off your answer to two decimal places.
- What is the average price of a home for each year the home was built, that has three bedrooms and three bathrooms? Round off your answer to two decimal places.
- What is the average price of a home for each year the home was built, that has three bedrooms, three bathrooms, two floors, and is greater than or equal to 2,000 square feet? Round off your answer to two decimal places.
- What is the average price of a home per "view" rating having an average home price greater than or equal to $350,000? Determine the run time for this query, and round off your answer to two decimal places.

![image](https://github.com/user-attachments/assets/d3226573-d135-4d66-accf-235a2992a87e)

![image](https://github.com/user-attachments/assets/142fec24-1ac2-4640-b5a5-baaf02667679)

# Compare options to decrease run time

Ran the same query 3 different ways to view differences in run time.
Added a run time calculation for the 4th query above.
-Time: 1.2482097148895264 seconds
Cached the temporary table and verified.
-Time: 0.529306173324585 seconds
-Partitioned the data.
-Time: 0.9826006889343262 seconds

![image](https://github.com/user-attachments/assets/54084d54-5db6-4f6b-933e-e9678986d266)

# Conclusion

We can see that caching the temporary table significantly decreased the run time. The cached data set ran in less than half the time of the uncached query.
The expectation was that partitioning the data would further increase the speed of the query but that wasn't true in this case.
Possible reasons:
-Small data set (33,000 rows)
-Data was partitioned on 'date built' but the was grouping by views and filtering by price.

Note that each time I've run the 3 test queries, the run time has been slightly different. My analysis is based on the output at the time I saved the notebook and downloaded it from collab. If the queries are run again, the time notations above will be different.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/leftcoastnerdgirl/big_data

Awesome Lists containing this project

README