Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rajeshthallam/mids-w251-final-project
Repo for final project of "MIDS W251 - Scaling up Really Big Data" course
https://github.com/rajeshthallam/mids-w251-final-project
Last synced: about 1 month ago
JSON representation
Repo for final project of "MIDS W251 - Scaling up Really Big Data" course
- Host: GitHub
- URL: https://github.com/rajeshthallam/mids-w251-final-project
- Owner: RajeshThallam
- Created: 2015-07-05T23:34:41.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2015-10-16T04:19:15.000Z (about 9 years ago)
- Last Synced: 2023-04-09T12:18:30.597Z (over 1 year ago)
- Language: Python
- Size: 1.88 MB
- Stars: 1
- Watchers: 4
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# MIDS W251 FINAL PROJECT
Repo for final project of "MIDS W251 - Scaling up Really Big Data" course## Wikipedia Trending Pages Analysis
### Team
- Amir Zai
- Rajesh Thallam
- Shelly Stanley### Abstract
Our project, “Wikipedia Trending Pages Analysis”, is an analytics tool tracking trending pages in Wikipedia. Wikipedia is one of the top most visited websites on the Internet with millions of articles, 18 billion page views and 500 million unique visitors per months. We gathered and analyzed traffic data and developed a system to (i) display the top 10 trending pages on Wikipedia in last 24 hours (ii) display top 10 trending pages on Wikipedia in last 30 days (iii) predicting page view traffic for the trending pages in last 24 hours and (iv) search and view past trends for a page or article based on users interest. The tool is built on Apache Spark platform hosted on HDFS and Hive with the end results visualized on Kibana hosted on ElasticSearch index. The project has successfully demonstrated how to scale up a data pipeline to process, store and analyze batches and streams of large data sets of volumes in terabytes and analyze traffic for approximately 3.5 million articles.**Keywords**: Wikipedia; Predictions; Trend Analysis; Apache Spark; Apache Hive; Elastic Search; Kibana; SoftLayer.
### Data Set
https://aws.amazon.com/datasets/4182
http://dumps.wikimedia.org/other/pagecounts-raw/2015/
https://dumps.wikimedia.org/enwiki/latest/