Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dvarshith/yelp-business-analysis
Big Data analysis on Yelp reviews/businesses for Arizona. Using Hadoop, Spark, PySpark.
https://github.com/dvarshith/yelp-business-analysis
arizona-state-university big-data big-data-analytics data-analysis hadoop pyspark spark yelp
Last synced: 4 days ago
JSON representation
Big Data analysis on Yelp reviews/businesses for Arizona. Using Hadoop, Spark, PySpark.
- Host: GitHub
- URL: https://github.com/dvarshith/yelp-business-analysis
- Owner: dvarshith
- License: mit
- Created: 2025-02-12T04:07:58.000Z (9 days ago)
- Default Branch: main
- Last Pushed: 2025-02-12T04:28:22.000Z (9 days ago)
- Last Synced: 2025-02-12T05:24:41.249Z (9 days ago)
- Topics: arizona-state-university, big-data, big-data-analytics, data-analysis, hadoop, pyspark, spark, yelp
- Language: Jupyter Notebook
- Homepage:
- Size: 686 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Yelp Dataset Analysis for Arizona Businesses
[data:image/s3,"s3://crabby-images/437a3/437a32320a81399ea0104565b821a85d1d29c422" alt="Hadoop"](https://hadoop.apache.org/)
[data:image/s3,"s3://crabby-images/eaf09/eaf099376fd7c876d88cb140824d4d1b9e602b2c" alt="Spark"](https://spark.apache.org/)
[data:image/s3,"s3://crabby-images/bd836/bd836a6e8dd1f203c2646b0eb8f1f95af3335185" alt="Python"](https://www.python.org/)
## Overview
This repository contains my **Yelp dataset** analysis project. The goal is to perform:
1. **Business-level analysis** (Milestone 1) – focusing on attributes, ratings, locations, etc.
2. **User-level analysis** (Milestone 2) – focusing on user behavior, sentiment, and user influence.We use **Apache Hadoop** for distributed file storage and **Apache Spark** (PySpark) for data processing.
The dataset is the [Yelp Academic Dataset](https://www.yelp.com/dataset) filtered to **Arizona** (AZ) businesses.
## Requirements & Setup
### 1. Virtual Machine (Provided by Course)
- A pre-configured VM (Ubuntu 22.04) is available with Hadoop, Spark, and PySpark already installed.
- **Username/Password**: `dps`
- If you need instructions, see the `docs/VM-setup.md` or the course instructions for VirtualBox/UTM usage.
### 2. Hadoop & Spark
- **Hadoop**: v3.x
- Start services:
```bash
hdfs namenode -format # first time only
start-dfs.sh
start-yarn.sh
```
- Web UIs:
- HDFS: [http://localhost:9870](http://localhost:9870)
- Yarn: [http://localhost:8088](http://localhost:8088)
- **Spark**: v3.x
- Interactive shell:
```bash
spark-shell
```
- PySpark:
```bash
pyspark
```
### 3. Python / PySpark
- **Python** 3.8+ recommended
- `pyspark` installed in the VM
- Optionally Jupyter Notebook for an interactive environment:
```bash
jupyter notebook
# or: pyspark
## Dataset
### Yelp Academic Dataset:
- [Official Link](https://www.yelp.com/dataset) – not included in this repo due to size.
- We filter data for Arizona (`state = 'AZ'`).
- `business_id` & `user_id` are common across the `business`, `review`, `user`, `checkin`, `tip` JSON files.
### Data Files
- `yelp_academic_dataset_business.json`
- `yelp_academic_dataset_user.json`
- `yelp_academic_dataset_review.json`
- `yelp_academic_dataset_checkin.json`
- `yelp_academic_dataset_tip.json`
## How to Run
- Clone Repo:
```
git clone https://github.com//yelp-arizona-business-analysis.git
cd yelp-arizona-business-analysis
```
- **Launch VM** (if using course-provided OVA/UTM).
- Start Hadoop & Spark (in the VM):
```
hdfs namenode -format # first time only
start-dfs.sh
start-yarn.sh
pyspark
```
- Open the Notebook:
```
cd Milestone1-Business
jupyter notebook Project1Milestone1.ipynb
```
- Run cells to see queries & analysis. Similarly for `Milestone2-User`.
## Milestone 1: Business-Level Analysis
- **Objective**: Analyze AZ businesses, focusing on attributes, ratings, categories, location patterns.
- **Approach**:
- Convert JSON to Parquet or a suitable Spark format.
- Filter to `state='AZ'`.
- Perform SQL-like queries in Spark (e.g., spark.sql("SELECT ... FROM ... WHERE ...")).
- Generate graphs & insights.
- **Queries**:
- 5 total (minimum), at least 3 complex queries combining multiple datasets.
- Example: “Top 10 highest-rated businesses in the ‘Restaurants’ category within Phoenix.”
## Milestone 2: User-Level Analysis
- **Objective**: Analyze user behavior (reviews, tips, sentiment, influence).
**Approach**:
- Focus on users who reviewed the AZ businesses from Milestone 1.
- Possibly do sentiment analysis on `review.txt` or `tip.txt`.
- Check user attributes (average stars, friend count, compliment counts).
**Queries**:
- 10 total, 6 of which combine multiple datasets.
Example: “Which users have the most influence (highest fans or compliment counts) in a specific business category?”
## Results
- Data filtering & approach
- Spark queries (with code snippets, no direct copy from provided notebooks)
- Graphs & figures
- Key insights from business-level and user-level analysis## Acknowledgments
- Dataset, test cases, etc. provided by Dr. Samira Ghayekhloo from Arizona State University.## License
This project is released under the `MIT License`. That means you’re free to use, modify, and distribute the code, but you do so at your own risk.## Contact
Author: Varshith Dupati
GitHub: @dvarshith
Email: [email protected]
Issues: Please open an issue on this repo if you have questions or find bugs.