Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ashirwadpradhan/yasmss
A Rest Client for converting SQL query template to MapReduce and Spark jobs
https://github.com/ashirwadpradhan/yasmss
mapreduce pyspark spark sql-query
Last synced: 4 days ago
JSON representation
A Rest Client for converting SQL query template to MapReduce and Spark jobs
- Host: GitHub
- URL: https://github.com/ashirwadpradhan/yasmss
- Owner: AshirwadPradhan
- License: mit
- Created: 2019-09-01T09:16:00.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2022-11-11T07:52:19.000Z (about 2 years ago)
- Last Synced: 2024-11-16T07:49:55.506Z (2 months ago)
- Topics: mapreduce, pyspark, spark, sql-query
- Language: Python
- Size: 113 KB
- Stars: 0
- Watchers: 5
- Forks: 1
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ## Yet Another SQL to MapReduce and Spark Segregator (yasmss)
## INDEX
| Sr. No | Contents |
| ----------- | ----------- |
| 1.0 | Problem Statement Overview |
| 1.1 | Description of the Input dataset
| 2.0 | Solution Description
| 3.0 | System Design
| 5.0 | Implementation
| 5.1 | Prerequisite Environment
| 5.2 | System Modules
| 5.3 | GROUPBY Logic (Mapreduce & Spark)
| 5.4 | JOIN Logic (Mapreduce & Spark)
| 6.0 | User Manual## 1.0 Problem Statement Overview
The aim of the project is to develop a REST-based service for the given dataset. This service will accept a query in the form of one of the two templates provided below.
The two standard templates are:
### Join Template
```sql
SELECT*FROM
INNERJOIN
ON
WHERE
Group By Template
SELECT,FUNC(COLUMN1)
FROM
GROUPBY
HAVING FUNC(COLUMN1) > X
```
*Here FUNC can be COUNT, MAX, MIN, SUM*The service translates the query into **MapReduce jobs** and also into **Spark job**. It should run these two jobs separately and return the following in a **JSON object**:
- [x] Time taken for Hadoop MapReduce execution.
- [x] Time taken for Spark execution.
- [x] Input and output of map and reduce tasks in a chain as they are applied on the data.
- [x] Spark transformations and actions in the order they are applied.
Result of the query.## 1.1 Description of the Input dataset
“Users” file stores the demographic information about the users.
The “zipcodes” file contains the city and state information for the zipcodes. Zipcode is referenced by users file.
“Movies” file stores the information about movies. The last 19 fields in movies table are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres.
“Rating” file contains the ratings given by users for movies. Userid and movieid refer to the users and movies files respectively.Users: userid | age | gender | occupation | zipcode
Zipcodes: zipcode | zipcodetype | city | state
Movies table: movieid | title | releasedate | unknown | Action | Adventure | Animation |Children | Comedy | Crime | Documentary | Drama | Fantasy |Film_Noir | Horror | Musical | Mystery | Romance | Sci_Fi |Thriller | War | Western |
Rating: userid | movieid | rating | timestamp
## 2.0 Solution Description
Refer to section [3.0 System Design](#) along with this description.
1. User submits query.
2. Core sends the string query to the Parser.
3. The Parser returns query object to the Core.
4. Core sends the parsed object to the Driver.
5. Driver sends the parsed query object to the MR Mapper and receives the TSV.
6. Driver also sneds the parsed query object to the Spark Mapper and receives the RDD.
7. TSV & RDD from Driver is sent to MR Result and Spark Result respectively.
8. JSON String from MR Result and Spark Result is passed to the Driver.
9. Driver passes the JSON String to the Core.
10. Core sends response to client as a JSON string.## 3.0 System Design
![Data flow](yasmss.jpg)
## 4.0 IMPLEMENTATION
## 4.1 Prerequisite Environment
1. LINUX MACHINE
2. JAVA 1.8_221
3. Hadoop 3.1.2 - HDFS & MapReduce
4. Spark 2.4.3
5. Python 3.6 & later
6. OpenSSH## 4.2 System Modules
1. Core:
Handles REST API request from user and sends to server
It’s the only module that talks to the parser
After receiving the parsed query from the parser it delegates the further processing to the driver.
Gives the final output to the client in the form of JSON string.
2. Parser:
Parses the input query and returns a JSON object3. Driver:
Driver takes over the control of the process after receiving the JSON object.
It manages the MR Mapper, Spark Mapper and their results.4. MR Mapper:
Takes each row input from the file through stdin (hadoop streaming) , identifies the key, value pair and pipes it to the reducer.
Reducer does the JOIN/GROUPBY operation, filters according to WHERE condition and outputs the result.5. Spark Mapper:
Reads the file from HDFS and converts it into dataframe.
Applies transformations on the dataframe like JOIN, FILTER, GROUPBY to get the desired functionality.6. MR Result:
Reads the output file from the HDFS and converts it to JSON string and outputs the result to the driver.7. Spark Result:
Takes the transformed dataframe and converts it record by record into JSON string.## 4.3 GROUPBY LOGIC
### MAP REDUCE LOGIC:
The GROUPBY mapper takes the comma separated values from the input file line by line through stdin. It also gets the parsed query object from the Driver. Based on the query it produces the key value pairs. Here the key is the SELECT columns (which may be one or more) and the value is the aggregate column. It pipes the key-value pair to the reducer.The reducer takes the key-value pairs generated by the mapper and groups the pairs having the same keys and filters the data according to the having condition specified in the query. It outputs the data in the form of tab separated values.
### SPARK LOGIC
The GROUPBY Spark mapper takes the comma separated values from the input file line by line through stdin. It also gets the parsed query object from the Driver. Based on the condition the spark mapper starts a spark job and read the csv file.As per the "aggregate" function and the "where" condition the spark mapper generates the output data.
## 4.4 JOIN LOGIC
### MAP REDUCE LOGIC:
The JOIN mapper takes the comma separated values from the 2 input files line by line through stdin. It also gets the parsed query object from the Driver. Based on the query it produces the key value pairs. It appends key “1” if the data is coming from table 1 otherwise appends 2 otherwise. It pipes the key-value pair to the reducer.The reducer takes the key-value pairs generated by the mapper and checks for the same values in join column indexes. If the values are same then it checks for the ‘where’ clause and if condition is satisfied then outputs the row values.
### SPARK LOGIC
The JOIN Spark mapper takes the comma separated values from the 2 input files line by line through stdin. It also gets the parsed query object from the Driver.Based on the query it distinguishes from_table and the join_table and retrieves data as required. It then performs join operation on those retrieved tables on the provided "KEY" or the "JOIN COLUMN" and based on the "where" clause provides the required rows.
## 5.0 USER MANUAL
This is built and tested on Hadoop 3.1.2 and Spark 2.4.3 which need Java 1.8.0_221 as dependency.
This is built and tested on Python 3.6 and later.
This project is only supported on Linux environment until now. (Windows support is on the way)>#### Download:
>[Hadoop 3.1.2](https://hadoop.apache.org/release/3.1.2.html)
>[Spark 2.4.3](https://spark.apache.org/downloads.html)
[Java 1.8.0](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)
[Python](https://www.python.org/downloads/)#### USING THIS PROJECT:
After all the pre-requisites mentioned above is installed on the machine then run the following commands on the terminal of your choice.1. Clone this Repository `git clone https://github.com/AshirwadPradhan/yasmss.git`
2. `cd` into the directory `yasmss`
3. Then run `pip3 install -r requirements.txt`
4. Then setup the following config files inside the project:
>4.1 `cd yasmss/` Open `config.yaml` in your favoraite code editor and modify the following config without the `{}`
```yaml
pathconfig:
host_ip_port : {insert path here For ex - https://localhost:9000}
hadoop_streaming_jar: {insert path here}
input_dir: {insert path here}
parent_output_dir: {insert path here}
child_output_dir: {insert path here}
```
>4.2 `cd` into `schema` directory
>4.3 Open `schemas.yaml` in your favoraite code editor. This file contains schema of all the tables to be used. Add the desired table schema in the following format:
```yaml
users:
userid : IntegerType
age: IntegerType
gender: StringType
occupation: StringType
zipcode: StringType
```
5. Now `cd ..` into the parent directory `yasmss`
6. Run `python core.py` and it will deploy the application on the localhost.
7. Go the address prompted by the flask using your browser and submit your query to receive the JSON reponse.#### Steps to contribute to this project:
- [x] Clone the repo
- [x] Make sure the repo is working locally
- [x] Ensure you are on the master branch
- [x] Make a branch of your own
- [x] Make changes in the the local copy and write good commit messages
- [x] Write test for your changes
- [x] Ensure that the test cases are passing
- [x] Push the changes to the origin remote and create a PR
- [x] Go to the fork of your project in the browser
- [x] Compare and pull request your branch
- [x] Ensure no merge conflicts are there
- [x] Ensure the base fork: is the correct repo and branch
- [x] Most Important : Add a description and check the diff
- [x] Now click on Create pull request