https://github.com/guochenmeinian/csci-ga.2437
Data Analytics Project using Big Data Tech Stacks
https://github.com/guochenmeinian/csci-ga.2437
apache-spark apache-zeppelin
Last synced: 4 months ago
JSON representation
Data Analytics Project using Big Data Tech Stacks
- Host: GitHub
- URL: https://github.com/guochenmeinian/csci-ga.2437
- Owner: guochenmeinian
- Created: 2024-11-24T08:06:59.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-11T12:35:14.000Z (over 1 year ago)
- Last Synced: 2025-03-31T05:28:39.911Z (about 1 year ago)
- Topics: apache-spark, apache-zeppelin
- Language: Jupyter Notebook
- Homepage:
- Size: 5.3 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Real Estate Data Analytics
This repository contains my project work for the [CSCI-GA.2437](https://cs.nyu.edu/courses/fall24/CSCI-GA.2437-001) **Big Data Application Development** course at NYU.
## Project Overview
This project involves the analysis of a large NYC real estate dataset using Apache Spark, Scala, and Zeppelin Notebook. The goal was to clean and transform the data, perform clustering to segment properties, and analyze spatial and temporal patterns to uncover insights into market trends and neighborhood dynamics.

## Tech Stack
- HDFS
- Apache Spark (MapReduce)
- Apache Zeppelin
- Scala
## Structure
- `data/`: small pieces of dataset for preview
- `figures/`: visualizations created with Apache Zeppelin / Python Pandas
- `code.ipynb`: zeppelin notebook for data exploration and modeling
- `slides.pdf`: our presentation slides
- `report.pdf`: our final report
## Analytics:
- **Utilized Apache Spark, Scala, and Zeppelin Notebook** to process and analyze a large NYC real estate dataset, implementing scalable workflows for data cleaning and transformation.
- Built a **KMeans clustering** pipeline with dimensionality reduction to segment properties into distinct clusters.
- Calculated growth rates and analyzed spatial and temporal patterns to uncover insights into seasonal market behaviors and neighborhood dynamics.
- Visualized results with **Zeppelin** and **Pandas Matplotlib**: bar charts for sale prices, scatter plots for clustering, and pie charts for transaction volumes to highlight trends.
## Overview
Below are some selected results showcasing the figures generated during the analysis.
### Bar Chart: Sale Prices by Category

### Bar Chart: Sale Prices by Neighborhood

### Growth Analysis: Temporal Patterns


### Scatter Plot: KMeans Clustering Results

