https://github.com/fbarffmann/home_sales
Analyzed 25,000+ home sales using PySpark and SparkSQL. Identified pricing trends by year built, home features, and view rating. Optimized query run-time by 70% using caching.
https://github.com/fbarffmann/home_sales
aws big-data data-analysis home-sales parquet pyspark python spark spark-sql sql
Last synced: about 19 hours ago
JSON representation
Analyzed 25,000+ home sales using PySpark and SparkSQL. Identified pricing trends by year built, home features, and view rating. Optimized query run-time by 70% using caching.
- Host: GitHub
- URL: https://github.com/fbarffmann/home_sales
- Owner: fbarffmann
- Created: 2024-09-25T21:37:33.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-04-13T17:38:08.000Z (11 days ago)
- Last Synced: 2025-04-13T17:39:27.321Z (11 days ago)
- Topics: aws, big-data, data-analysis, home-sales, parquet, pyspark, python, spark, spark-sql, sql
- Language: Jupyter Notebook
- Homepage:
- Size: 2.48 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Home Sales Analysis with PySpark
Built a scalable data analysis pipeline using PySpark to explore pricing trends in home sales across King County, Washington. Leveraged SparkSQL for querying and partitioned the dataset to optimize performance on large-scale data.
## Tools & Technologies Used
- Python
- PySpark
- SparkSQL
- Parquet File Partitioning
- AWS S3 (Data Source)
- Jupyter Notebooks## File Structure
```text
.
├── Home_Sales.ipynb # PySpark analysis notebook
├── home_sales_partitioned/ # Partitioned parquet files by year built
└── Resources/
└── home_sales.csv # Raw home sales dataset
```## Skills Demonstrated
- Distributed data processing with PySpark
- SQL querying within Spark
- Data partitioning and caching for optimized performance
- Handling large real-world datasets
- Identifying pricing trends from structured data## Key Findings
- Analyzed over 25,000 home sales in King County, WA.
- 4-bedroom homes sold for an average price between $300,263 and $306,910 per year.
- Homes with 3 beds, 3 baths, 2 floors, and 2,000+ sqft averaged over $600,000 after 2015.
- Homes with a view rating of 4 or higher had an average sale price exceeding $350,000.
- Partitioning data by year built improved query performance by over 70%.