Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/leftcoastnerdgirl/big_data
This project uses PySpark and SQL to analyze Big Data.
https://github.com/leftcoastnerdgirl/big_data
jupyter-notebook pandas-python pyspark spark-sql sparksession sql structured-query-language
Last synced: 10 days ago
JSON representation
This project uses PySpark and SQL to analyze Big Data.
- Host: GitHub
- URL: https://github.com/leftcoastnerdgirl/big_data
- Owner: LeftCoastNerdGirl
- Created: 2024-03-29T19:19:04.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-08-25T20:48:19.000Z (3 months ago)
- Last Synced: 2024-08-25T22:29:03.928Z (3 months ago)
- Topics: jupyter-notebook, pandas-python, pyspark, spark-sql, sparksession, sql, structured-query-language
- Language: Jupyter Notebook
- Homepage: https://extension.berkeley.edu/search/publicCourseSearchDetails.do?method=load&courseId=35106003
- Size: 44.9 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Using PySpark to analyze large data sets
# Data prep
- Imported the tools needed for the analysis.
- Created a Spark session to enable the work.
- Read the AWS data file and formatted in a dataframe.# Using a temp view
- Created a temp view to improve processing time of the large data set.
- Created 4 SQL queries to answer the following questions:
- What is the average price for a four-bedroom house sold for each year? Round off your answer to two decimal places.
- What is the average price of a home for each year the home was built, that has three bedrooms and three bathrooms? Round off your answer to two decimal places.
- What is the average price of a home for each year the home was built, that has three bedrooms, three bathrooms, two floors, and is greater than or equal to 2,000 square feet? Round off your answer to two decimal places.
- What is the average price of a home per "view" rating having an average home price greater than or equal to $350,000? Determine the run time for this query, and round off your answer to two decimal places.![image](https://github.com/user-attachments/assets/d3226573-d135-4d66-accf-235a2992a87e)
![image](https://github.com/user-attachments/assets/142fec24-1ac2-4640-b5a5-baaf02667679)
# Compare options to decrease run time
Ran the same query 3 different ways to view differences in run time.
Added a run time calculation for the 4th query above.
-Time: 1.2482097148895264 seconds
Cached the temporary table and verified.
-Time: 0.529306173324585 seconds
-Partitioned the data.
-Time: 0.9826006889343262 seconds![image](https://github.com/user-attachments/assets/54084d54-5db6-4f6b-933e-e9678986d266)
# Conclusion
We can see that caching the temporary table significantly decreased the run time. The cached data set ran in less than half the time of the uncached query.
The expectation was that partitioning the data would further increase the speed of the query but that wasn't true in this case.
Possible reasons:
-Small data set (33,000 rows)
-Data was partitioned on 'date built' but the was grouping by views and filtering by price.Note that each time I've run the 3 test queries, the run time has been slightly different. My analysis is based on the output at the time I saved the notebook and downloaded it from collab. If the queries are run again, the time notations above will be different.