https://github.com/vbalalian/three-gits
Group analytics project for a predictive analytics course. Using the Yelp open dataset to predict restaurant success.
https://github.com/vbalalian/three-gits
bigquery dbt predictive-analytics python regression sentiment-analysis sklearn sql vader-sentiment-analysis yelp-dataset
Last synced: about 1 month ago
JSON representation
Group analytics project for a predictive analytics course. Using the Yelp open dataset to predict restaurant success.
- Host: GitHub
- URL: https://github.com/vbalalian/three-gits
- Owner: vbalalian
- Created: 2025-11-19T20:06:54.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-12-04T23:04:05.000Z (7 months ago)
- Last Synced: 2025-12-08T06:56:14.741Z (6 months ago)
- Topics: bigquery, dbt, predictive-analytics, python, regression, sentiment-analysis, sklearn, sql, vader-sentiment-analysis, yelp-dataset
- Language: Jupyter Notebook
- Homepage:
- Size: 1.63 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Group Analytics Project
**Course:** BANA 320 - Predictive Analytics
**Dataset:** [Yelp Open Dataset](https://business.yelp.com/data/resources/open-dataset/)
**Group Name:** Three Gits
**Team:** Vincent Balalian, Sameer Patel, Arish Patel
[](https://github.com/vbalalian/three-gits/actions/workflows/cd.yml)
## Overview
**Problem Statement:** Does adding sentiment features from the first 90 days of reviews significantly improve the accuracy of predicting a restaurant's 1-year Yelp rating, compared to using non-text features alone?
**Target Variable:** Average star rating 12 months after first review
**Result:** No - sentiment features did not significantly improve prediction accuracy when geographic and user-based context features were already included.
## Data Pipeline
Raw Yelp JSON data (hosted in Google Cloud Storage) is transformed via **dbt + BigQuery**:
1. **Filtering:** Business dataset filtered to restaurants only
2. **Qualification criteria:**
- Minimum 1 year of review history
- At least 3 reviews in first 90 days
- At least 10 reviews in first year
3. **Feature aggregation:** Check-ins, user metrics, and zip-code comparisons aggregated to restaurant level
4. **Time windowing:** Separate datasets for 90-day (features) and 365-day (target) review periods
## Feature Categories
| Category | Description | Count |
|----------|-------------|-------|
| Early Review Metrics | Review count & average rating (first 90 days) | 2 |
| Check-in Patterns | Day-of-week and time-of-day distributions | 11 |
| User Characteristics | Reviewer reputation, engagement, experience | 10 |
| Zip Code Context | Restaurant metrics vs. local averages | 4 |
| **Sentiment (VADER)** | Positive, negative, neutral, compound scores | 4 |
## Methodology
**Phase 1:** Train models using all features *except* sentiment
**Phase 2:** Add sentiment features derived from VADER analysis of first-90-day reviews
**Comparison:** Paired one-tailed t-test on absolute prediction errors (α = 0.05)
### Models Tested
- Support Vector Regression (linear kernel)
- Linear Regression
- Random Forest Regressor
- XGBoost Regressor
- Stacked Ensemble (average of all four)
## Key Findings
1. **Geographic context subsumes sentiment signal** - Zip-code comparison features (rating vs. local average, etc.) already capture competitive positioning that sentiment would otherwise indicate.
2. **Early rating is the dominant predictor** - The 90-day average rating had the highest mutual information with 1-year rating by a wide margin.
3. **Reviewer quality correlates with outcomes** - User characteristics (average stars given, engagement metrics) ranked among top predictive features.
## Tools & Technologies
- **Data Warehouse:** Google BigQuery
- **Transformation:** dbt
- **Analysis:** Python (pandas, scikit-learn, XGBoost, VADER)
- **Environment:** Google Colab
- **CI/CD:** GitHub Actions
## Analysis
The complete analysis notebook is available [here](./analysis/bana320_three_gits.ipynb). It includes all code, visualizations, and statistical test results.
> **Note:** The data pipeline requires access to a private Google Cloud project. The notebook is provided for review purposes; reproducing it would require setting up your own BigQuery environment with the Yelp Open Dataset.
*BANA 320 - Fall 2025*