https://github.com/lamdav/hadoopyelpproject
Repository for our Hadoop class project
https://github.com/lamdav/hadoopyelpproject
hadoop hive java java-8 pig python yelp
Last synced: 7 months ago
JSON representation
Repository for our Hadoop class project
- Host: GitHub
- URL: https://github.com/lamdav/hadoopyelpproject
- Owner: lamdav
- Created: 2017-01-20T19:46:50.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-02-16T07:50:26.000Z (over 8 years ago)
- Last Synced: 2025-02-24T03:32:56.693Z (7 months ago)
- Topics: hadoop, hive, java, java-8, pig, python, yelp
- Language: Shell
- Size: 990 KB
- Stars: 0
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Team RAD: Hadoop-Yelp-Project
## Collaborators:
**R**unzhi Yang, **A**dam Finer, **D**avid Lam## Feature Goals:
- [x] Feature 1: We want to reaggregate the business’s review with an emphasis on the most recent reviews.
- [x] Feature 2: We want to find the average ratings for all the businesses in each city over a given time interval.
- [x] Feature 3: We want to find the highest rated business over a given time interval.
- [x] Feature 4: We want to find a leaderboard of active users within a given time interval.
- [x] Feature 5: We want, for the above two features, to specify specific features of the types of businesses to be analyzed (like distance from a location, city location, type of business, etc).
- [x] Feature 6: We want to summarize large reviews with ~~500~~ 1000 characters or more with Sumy python library.
- [x] Feature 7: We will write a MapReduce job to run a python process to execute Sumy summarizations.
- [x] Feature 8: Create dynamic rating for business every time it is reviewed, and the rating declines every month. Then plot the business rating over a period of time using zeppelin notebook.## Milestone:
### Milestone 01:
- Runzhi:
- [x] Research on how to run query on ~~HBase~~ Hive.
- Adam:
- [x] Write a script to move data to cluster and run jobs on cluster without ssh.
- David:
- [x] Research and write pig scripts that communicate with ~~HBase~~ Hive.
- All:
- [x] Discuss how to accomplish aggregate scoring once data is in cluster.### Milestone 02:
- Runzhi:
- [x] Begin working with Hive reaggregation of review score.
- [ ] ~~Partition Tables properly.~~ (Moved to next Milestone due to bug encountered)
- Adam:
- [x] Begin working on feature 2: find the average ratings for all the businesses in each city over a given time interval.
- David:
- **Proposed:**
- [x] Begin working with Hive data to reaggregate review score with weightings.
- **Extras:**
- [x] Reworked logistics of review score reaggregation.
- [x] Finished UDF for Hive review score reaggregation
- [x] Updated Pig Scripts to remove rows with any null.### Milestone 03:
- Runzhi:
- [x] Start feature 4: We want to find a leaderboard of active users within a given time interval.
- [x] Partition Tables properly.
- Adam:
- [x] Finish feature 2: find the average ratings for all the businesses in each city over a given time interval.
- [x] Start feature 3: We want to find the highest rated business over a given time interval.
- David:
- **Proposed:**
- [x] Investigate Geo-fencing for Feature 5: We want, for the above two features, to specify specific features of the types of businesses to be analyzed (like distance from a location, city location, type of business, etc).
- [x] Start on feature 6: We want to summarize large reviews with ~~500~~ 1000 characters or more with Sumy python library.
- **Extras:**
- [x] Complete feature 6: We want to summarize large reviews with ~~500~~ 1000 characters or more with Sumy python library.
- [x] Complete feature 7: We will write a MapReduce job to run a python process to execute Sumy summarizations.### Milestone 04:
- Runzhi:
- [x] Partition the reviews table base on month and year.
- [x] New feature: add points for business every time it is reviewed, and the points declines every day.
- Adam:
- [x] Select business over a time interval, and plot its points chart.
- [x] Start on feature 5: We want, for the above two features, to specify specific features of the types of businesses to be analyzed (like distance from a location, city location, type of business, etc).
- David:
- [x] Determine where to export data (Hive MYSQL, Heroku Postgres, etc.)
- [x] Start writing sqoop scripts to output to some database### Milestone 05:
- Runzhi, Adam, David:
- [x] Play with oozie, and see how it can work in our project.