{"id":33330259,"url":"https://github.com/vbalalian/three-gits","last_synced_at":"2026-05-13T01:02:31.915Z","repository":{"id":325233179,"uuid":"1100076242","full_name":"vbalalian/three-gits","owner":"vbalalian","description":"Group analytics project for a predictive analytics course. Using the Yelp open dataset to predict restaurant success.","archived":false,"fork":false,"pushed_at":"2025-12-04T23:04:05.000Z","size":1711,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-08T06:56:14.741Z","etag":null,"topics":["bigquery","dbt","predictive-analytics","python","regression","sentiment-analysis","sklearn","sql","vader-sentiment-analysis","yelp-dataset"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vbalalian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-19T20:06:54.000Z","updated_at":"2025-12-04T23:04:08.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/vbalalian/three-gits","commit_stats":null,"previous_names":["vbalalian/three-gits"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/vbalalian/three-gits","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vbalalian%2Fthree-gits","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vbalalian%2Fthree-gits/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vbalalian%2Fthree-gits/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vbalalian%2Fthree-gits/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vbalalian","download_url":"https://codeload.github.com/vbalalian/three-gits/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vbalalian%2Fthree-gits/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32963176,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-12T23:30:32.555Z","status":"ssl_error","status_checked_at":"2026-05-12T23:30:18.191Z","response_time":102,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","dbt","predictive-analytics","python","regression","sentiment-analysis","sklearn","sql","vader-sentiment-analysis","yelp-dataset"],"created_at":"2025-11-20T18:01:32.451Z","updated_at":"2026-05-13T01:02:31.907Z","avatar_url":"https://github.com/vbalalian.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Group Analytics Project\n\n**Course:** BANA 320 - Predictive Analytics  \n**Dataset:** [Yelp Open Dataset](https://business.yelp.com/data/resources/open-dataset/)  \n**Group Name:** Three Gits  \n**Team:** Vincent Balalian, Sameer Patel, Arish Patel  \n\n[![CD](https://github.com/vbalalian/three-gits/actions/workflows/cd.yml/badge.svg)](https://github.com/vbalalian/three-gits/actions/workflows/cd.yml)\n\n## Overview\n\n**Problem Statement:** Does adding sentiment features from the first 90 days of reviews significantly improve the accuracy of predicting a restaurant's 1-year Yelp rating, compared to using non-text features alone?\n\n**Target Variable:** Average star rating 12 months after first review\n\n**Result:** No - sentiment features did not significantly improve prediction accuracy when geographic and user-based context features were already included.\n\n## Data Pipeline\n\nRaw Yelp JSON data (hosted in Google Cloud Storage) is transformed via **dbt + BigQuery**:\n\n1. **Filtering:** Business dataset filtered to restaurants only\n2. **Qualification criteria:**\n   - Minimum 1 year of review history\n   - At least 3 reviews in first 90 days\n   - At least 10 reviews in first year\n3. **Feature aggregation:** Check-ins, user metrics, and zip-code comparisons aggregated to restaurant level\n4. **Time windowing:** Separate datasets for 90-day (features) and 365-day (target) review periods\n\n## Feature Categories\n\n| Category | Description | Count |\n|----------|-------------|-------|\n| Early Review Metrics | Review count \u0026 average rating (first 90 days) | 2 |\n| Check-in Patterns | Day-of-week and time-of-day distributions | 11 |\n| User Characteristics | Reviewer reputation, engagement, experience | 10 |\n| Zip Code Context | Restaurant metrics vs. local averages | 4 |\n| **Sentiment (VADER)** | Positive, negative, neutral, compound scores | 4 |\n\n## Methodology\n\n**Phase 1:** Train models using all features *except* sentiment  \n**Phase 2:** Add sentiment features derived from VADER analysis of first-90-day reviews  \n**Comparison:** Paired one-tailed t-test on absolute prediction errors (α = 0.05)\n\n### Models Tested\n- Support Vector Regression (linear kernel)\n- Linear Regression\n- Random Forest Regressor\n- XGBoost Regressor\n- Stacked Ensemble (average of all four)\n\n## Key Findings\n\n1. **Geographic context subsumes sentiment signal** - Zip-code comparison features (rating vs. local average, etc.) already capture competitive positioning that sentiment would otherwise indicate.\n\n2. **Early rating is the dominant predictor** - The 90-day average rating had the highest mutual information with 1-year rating by a wide margin.\n\n3. **Reviewer quality correlates with outcomes** - User characteristics (average stars given, engagement metrics) ranked among top predictive features.\n\n## Tools \u0026 Technologies\n\n- **Data Warehouse:** Google BigQuery\n- **Transformation:** dbt\n- **Analysis:** Python (pandas, scikit-learn, XGBoost, VADER)\n- **Environment:** Google Colab\n- **CI/CD:** GitHub Actions\n\n## Analysis\n\nThe complete analysis notebook is available [here](./analysis/bana320_three_gits.ipynb). It includes all code, visualizations, and statistical test results.\n\n\u003e **Note:** The data pipeline requires access to a private Google Cloud project. The notebook is provided for review purposes; reproducing it would require setting up your own BigQuery environment with the Yelp Open Dataset.\n\n*BANA 320 - Fall 2025*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvbalalian%2Fthree-gits","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvbalalian%2Fthree-gits","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvbalalian%2Fthree-gits/lists"}