https://github.com/kshru9/hive-data-consumption-app
This application can be used to query hive data warehouse in a simplified manner
https://github.com/kshru9/hive-data-consumption-app
angularjs apache-hive docker-compose hiveql hiveserver2 html-css-javascript scheduler spring-boot typescript
Last synced: 2 months ago
JSON representation
This application can be used to query hive data warehouse in a simplified manner
- Host: GitHub
- URL: https://github.com/kshru9/hive-data-consumption-app
- Owner: kshru9
- Created: 2021-08-12T07:03:37.000Z (almost 5 years ago)
- Default Branch: main
- Last Pushed: 2021-08-12T07:27:38.000Z (almost 5 years ago)
- Last Synced: 2025-02-28T08:57:43.121Z (over 1 year ago)
- Topics: angularjs, apache-hive, docker-compose, hiveql, hiveserver2, html-css-javascript, scheduler, spring-boot, typescript
- Language: Java
- Homepage:
- Size: 1 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Hive Data Consumption Application
# Overview
The aim is to develop a Web Application functional with REST APIs for consuming data from hive datawarehouse
This app will query the hive data warehouse using requested columns, filters and limit. It will return all the information in form of a file to the user.
# [Table of Contents](#table-of-contents)
- [Functionalities](#functionalities)
- [Architecture](#architecture)
- [Client](#client)
- [Server](#server)
* [Apis and its description](#apis-and-its-description)
+ [Form Submission API](#form-submission-api)
+ [Status API](#status-api)
+ [Get Databases API](#get-databases-api)
+ [Get Tables API](#get-tables-api)
+ [Get Columns API](#get-columns-api)
* [Support methods](#support-methods)
+ [Validation](#validation)
+ [Task Schduler](#task-schduler)
- [Technologies used](#technologies-used)
- [How to run?](#how-to-run-)
- [Swagger Documentation](#swagger-documentation)
- [Future Goals](#future-goals)
- [Acknowledgement](#acknowledgement)
# Functionalities
Query a single hive table using column names, apply filters using where clause and add limit.
# Architecture

# Server
## Apis and its description
We have multiple REST APIs for different functions.
### Form Submission API
- It is a POST api
- `Aim`: To save the details of form submitted by the user, validate the details and return the file location in which query results will be stored and UUID to the user
- Can be hit using `localhost:8080/api/save`. It is called on form submit by the user
- All the information submitted by user through the form is given in a JSON format in the body of the api as shown below
{
"columns": ["name", "age"],
"filters": ["name"],
"limit": "100",
"table": "medicare_demographic",
"db": "default"
}
- A UUID is assigned to submitted `Request`
- A response `Valid` is added to this `Request` and stored as `GetResponse`
- An Example of `GetResponse` object is shown below.
{
response:Valid,
columns:[name, age],
filters:[name],
limit:100,
table:medicare_demographic,
database:default
}
- The above `GetResponse` is stored in a map ``. This map will contain the updated status of hive query from the backend
- An example of `RequestMap` is shown below
{
2a5c211d-5b24-43ac-b1f4-362d3b3abe1d :
{
response:Valid,
columns:[name, age],
filters:[name],
limit:100,
table:medicare_demographic,
database:default
}
}
- Validate the `Request` and return the UUID, file location and response to the user
### Status API
- It is a GET api
- `Aim`: To return the current status of query from the hashmap
- Can be hit using `localhost:8080/api/status/{UUID}`. It is called on refresh button click by the user
### Get Databases API
- It is a GET api
- `Aim`: To return list of databases collected from the hive warehouse
- Can be hit using `localhost:8080/api/getdbs`
### Get Tables API
- It is a GET api
- `Aim`: To return list of tables collected from the hive warehouse for a particular database
- Can be hit using `localhost:8080/api/gettables/{db}`
### Get Columns API
- It is a POST api
- It takes in the database name and table name as its payload
- `Aim`: To return list of columns collected from the hive warehouse for a particular database and table
- Can be hit using `localhost:8080/api/getcols`
## Support methods
### Validation
- `input params`: Array of column names, Array of filters, limit, source name, database name
- It will check if source and database exists in the data warehouse
- It will then check if the given columns exists in given database and the limit provided is valid
- It will also run the data type matching function
- It uses regex pattern matcher for appropriate data type matching and left and right clauses of a filter condition
- Wholsome checks have been used to validate the LHS of filter, the in-between operator type followed by data type check based on LHS columns, all of which together make the system safe to sql injections
- Appropriate error/valid conditions are set as the returning message, which in turn is used to set the response variable to be added to the global hashmap
### Task Scheduler
- It is scheduled to run for every 1 sec.
- It will get the UUID key with value `Valid` from the hashmap and run the query according to the `Request` and update the value as `Started and Running` in the hashmap
- It will update the response of query in hashmap as `Complete` or `Failure` on the successful or unsuccessful query completion respectively
- It will write 'No records found' in the generated file if the validation and query execution is successful but number of matching records is null
# Technologies used
- Spring boot
- JDK 8
- Hive Query Language
- Angular
- TypeScript
- HTML/CSS
- Apache Hadoop
- Apache Hive
- Maven
- Spring boot Swagger UI
# How to run?
- On windows
- Run server, using maven: `.\mvnw spring-boot:run`
- Run client, using ng: `ng serve`
# Swagger Documentation
- hit `localhost:8080/swagger-ui.html`
Thanks to Ishita and Parnika for contributing robust validation and query method and other rest apis. \
Table of contents generated with markdown-toc