https://github.com/qubole/demotrends
Code required to setup the demo trends website (http://demotrends.qubole.com)
https://github.com/qubole/demotrends
Last synced: 11 months ago
JSON representation
Code required to setup the demo trends website (http://demotrends.qubole.com)
- Host: GitHub
- URL: https://github.com/qubole/demotrends
- Owner: qubole
- License: apache-2.0
- Created: 2013-08-06T17:03:34.000Z (almost 13 years ago)
- Default Branch: master
- Last Pushed: 2016-09-26T08:26:12.000Z (over 9 years ago)
- Last Synced: 2024-04-17T22:49:27.309Z (about 2 years ago)
- Language: Ruby
- Size: 499 KB
- Stars: 6
- Watchers: 6
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DemoTrends (http://demotrends.qubole.com)
A Big Data app that displays the topics that are trending on Wikipedia.
There are two main parts:
1. Webapp in Ruby on Rails.
2. Data pipeline hosted in *Qubole Data Service*
You can read more about demotrends in this [Blog](https://www.qubole.com/blog/big-data/build-a-data-pipeline-with-qubole)
## Quick Start
1. Register for a [Trial Plan] (http://www.qubole.com/try) in Qubole
2. [Obtain the API key] (http://www.qubole.com/qds-api-reference/authentication/)
3. Run the commands in the *commands* directory
## Webapp
Code required to setup the demo trends website (http://demotrends.qubole.com)
#### Set up
1. Create the database - `./webapp/script/init-mysql.sh`
2. Run the migrations: `rake db:migrate`
#### Populate Data in db
1. Using Sample Data: `rake db:seed` These will insert one row in each of the tables.
2. Using SQL Dump: You can also use SQL dump file to populate your DB. This file has data from processed data from 30th June 2013 - 13th August 2013.
`sudo mysql trend < webapp/db/sqldump/mysqldump_13AUG13.sql`
#### Start the webapp
1. Run `./webapp/script/restart_server.sh`
## Data Pipeline
### Hive
Directory contains two UDFs required by the data pipeline:
1. collect_all - A JAR UDF
2. hive_trend_mapper - A Python UDF
### Scripts
Directory contains scripts that are run in a *Shell Command*.
1. pagecount_dump.py - A script to download ONE days *pagecounts* data from the Wikimedia website.
### Commands
Directory contains all the commands to process one day's worth of data.
The sequence of commands is important. The filenames start with a number specifying the sequence it should be executed in.
Run the scripts using [Qubole Python SDK] (https://github.com/qubole/qds-sdk-py)
### Airflow
If you want to use [Apache Airflow](https://github.com/apache/incubator-airflow) to manage the pipeline, please look at `airflow` folder.