Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/javrasya/yelp-data

YELP Data Analytic Platform Project
https://github.com/javrasya/yelp-data

Last synced: 6 days ago
JSON representation

YELP Data Analytic Platform Project

Awesome Lists containing this project

README

        

# YELP Data Processing Project

This is the academic yelp data processing project consist of `pipeline`, `processor` and `data serving layer` parts based on SMACK concept. While processor part is
developed as `Spark` project runs on `Cassandra`, pipeline part is developed with python `Luigi` and `GenericLuigi` which is an open source project to improve
luigi functionalities like task re-usability and defining luigi pipeline flow as a json. I authored the `GenericLuigi` project. Data serving layer is `Jupyter` and
some prepared `Jupyter` notebooks contains interesting queries.

This project provide ease of processing YELP academic dataset by minimum code changes required on a schema update or new one comes up.

-----------------------------------------------------------------

| ![yelp_logo](https://user-images.githubusercontent.com/1279644/28669732-02ab458c-72de-11e7-9feb-92e2d128a4d9.png) | ![spark-logo](https://user-images.githubusercontent.com/1279644/28669729-02a2157a-72de-11e7-94d6-4597202a2d45.png) | ![cassandra-logo](https://user-images.githubusercontent.com/1279644/28669731-02a7374e-72de-11e7-8ed5-d2e1c8e4557f.png) | ![jupyter-logo](https://user-images.githubusercontent.com/1279644/28708360-acb2cb30-7384-11e7-9c7f-fe5535c29278.png) | ![luigi-logo](https://user-images.githubusercontent.com/1279644/28669730-02a6311e-72de-11e7-9786-e891ec82d930.png) | ![docker-logo](https://user-images.githubusercontent.com/1279644/28669850-aaa53608-72de-11e7-8db7-408b16a2b174.png)|
|:-:|-|-|-|-|-|

## Demo

**Bigger version of sample data** is already processed by that deployed demo platform.

A password is required to login. You can ask the login password to test it.

[http://yelp.ahmetdal.org](http://yelp.ahmetdal.org)

**Note:** If it is down, please let me know.

## Running

**Note:** This documentation is for local environment. If it is asked to run it on production like, submitting spark jobs on `Messos`, `YARN` or somethign,
or with some other production `Cassandra` cluster, entire platform can be configured simply to do this.

```bash
git clone https://github.com/javrasya/yelp-data.git
cd ./yelp-data/
chmod 777 ./*.sh
```

-----------------------------------------------------------------
### Important Note:

With my local environment, it wasn't enough to process real large yelp academic data. I sampled the data for my unit and integration test cases. if you face some issues like slowness(writing into cassandra running in docker container may be slow and may cause timeout for `LOCAL_QUORUM`). Bigger sample data can also be used;

-----------------------------------------------------------------

### Starting Platform

With pipeline execution ; it will extract tar file and then trigger `Spark` job to process the data to convert them into tabular format in `Cassandra`. In Here, **`` can be given by relacing it with the location of the yelp tar file on hosting machine as first parameter for the `start-all.sh` script.**. I suggest to use sampled data for local environment by running `start-all.sh` without an argument. There are two sample data which are `smaller` and `larger` ones.

```bash
#-----------------------------------------------------------------
# # if you do not give the tar file, bigger sample tar file will be used.
#-----------------------------------------------------------------

./start-all.sh
```

or

```bash
./start-all.sh

#-----------------------------------------------------------------
# # Example
#
# ./start-all.sh $PWD/processor/src/main/resources/sample_data.tar
# ./start-all.sh $PWD/processor/src/main/resources/sample_data_bigger.tar
#-----------------------------------------------------------------
```
-----------------------------------------------------------------

### Starting Data Processing

```bash
# This will extract tar file and convert json data into tabular format in Cassandra.

./start-data-processing.sh
```

If you even trigger the flow more than once, tar extractor won't be executed more than once under the favor of the idempotency feature of `Luigi`.

On `Spark` data write, `Cassandra` container exits suspiciously without an error (which I also faced and couldn't figure out whats happening), you can use that sampled data to check wheter entire platform is working or not. So, when you faced such issue, try to start cassandra container back.

```bash
docker start cassandra
```

-----------------------------------------------------------------

## Executing Example Queries

This is provided by introducing `Jupyter` here. Check [Jupyter Web Interface (http://localhost:8888)](http://localhost:8888/) and to execute an example just click an example and then run it. The exact url is given in logs generated by `start-all.sh` script seen below.

![screen shot 2017-07-28 at 13 41 32](https://user-images.githubusercontent.com/1279644/28714087-7de6a3e2-739a-11e7-9f41-c064c0e1df2d.png)

### Examples
#### Positive Words by Business:

This is the example of finding positive words by ignoring stop words(in English and German) and ordering them with level of how much they are positive.

![screen shot 2017-07-28 at 10 05 38](https://user-images.githubusercontent.com/1279644/28706408-d79a7eb8-737c-11e7-8dbe-d0e91c3dda9c.png)

#### Bad Things are Happenning for Some Business:

This example detects bad things happenning by checking whethere sequental low stars in their reviews. To be clear, when there are less than 3 stars in a row, (It is actually not in a row, at least 3 low stars in 5 rows), it shows them.

![screen shot 2017-07-28 at 10 06 44](https://user-images.githubusercontent.com/1279644/28706336-894a594a-737c-11e7-9058-76b723edcaa3.png)

## Building(OPTIONAL)

### Building Processor Fat Jar

**Note:** There is no need to build fat jar. Because it exists in the repository as pre-built.

This phase requires `sbt` installed ideally, but fat jar is included in this git repository to test easily without `sbt`. Just in case;

```bash
cd /path/to/yelp_data/processor/
sbt clean assembly
```

(by the way all test cases are supposed to be passed); If some test cases are broken, skip them;

```
cd /path/to/yelp_data/processor/
sbt 'set test in assembly := {}' clean assembly
```

### Building Docker Image

**Note:** There is no need to build docker image. Because it exists in `DockerHub` and all git commits will trigger automated docker build.

Because it is based on the `Jupyter-AllSpark` image, it's size is **a bit large** (2 GB on `DockerHub`). This image will contain `Jupyter`,`Apache Toree`,`Java`,`Spark`, `PySpark` and `the platform code`(which consist of `processor` and `pipeline`).

```bash
docker build -t yelp-data-platform .
docker tag yelp-data-platform ahmetdal/yelp-data-platform
docker push ahmetdal/yelp-data-platform
```