Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/narius2030/sakila-datawarehouse-analysis
Implement a Hive data warehouse to store meaningful data, apply Machine Learning like Clustering or Regression for dealing with business problems
https://github.com/narius2030/sakila-datawarehouse-analysis
apache-hadoop apache-hive data-analysis etl-pipeline hiveql machine-learning statistics
Last synced: about 1 month ago
JSON representation
Implement a Hive data warehouse to store meaningful data, apply Machine Learning like Clustering or Regression for dealing with business problems
- Host: GitHub
- URL: https://github.com/narius2030/sakila-datawarehouse-analysis
- Owner: Narius2030
- Created: 2024-04-22T17:15:47.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-06-13T15:11:41.000Z (8 months ago)
- Last Synced: 2024-12-14T13:36:25.643Z (about 1 month ago)
- Topics: apache-hadoop, apache-hive, data-analysis, etl-pipeline, hiveql, machine-learning, statistics
- Language: Jupyter Notebook
- Homepage:
- Size: 24.9 MB
- Stars: 1
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Introduction
Data set context: Sakila contains data about movie rental and payment transactions. In addition, it also contains information about movies and customers.**Goal:** Build a data warehouse with Apache Hive, extract, transform, and store (ETL) in Dimension and Fact tables. Serves for analyzing operations, helping to improve business strategies to bring profits to businesses
### 1. Customer Segmentation
Customer segmentation: aims to group customers according to certain behavioral characteristics, forming data clusters. From there, marketing campaigns or incentives will easily reach the appropriate customer groups
- Use clustering model according to `KMean` algorithm: important attributes in this model are `recency`, `frequency` and `monetary`
- The performance measures of the model are `SSE`, sum of squared distances to cluster centers and `Silhouette Coefficient` method. From there, you can choose the best number of clusters so that the data is clustered more clearly### 2. Film Inventory Analysis
The goal is to track the quantity of inventory and revenue of each film in the film warehouse. So that, there will be strategies to adjust the process of importing and exporting inventory accordingly
## Integration
```python
from src.hiveconnect import hiveappusername='***' # your window username
# Create dimension tables
hiveapp.CreateTableDimRental(username=username)
hiveapp.CreateTableDimCustomer(username=username)
DimInventory=hiveapp.CreateDimInventory(username)
DimDate=hiveapp.CreateTableDimDate(username)# Load source data to stages (csv) in preprocessing folder
# Load csv stages to dimension tables
hiveapp.LoadData('dimRental.txt', 'dim_rental', username=username)
hiveapp.LoadData('dimCustomer.csv', 'dim_customer', username=username)
Load_data_to_DimInventory=hiveapp.LoadData("dimInventory.csv","DimInventory", username)
Load_data_to_DimDate=hiveapp.LoadData("dimDate.csv","DimDate", username)# Create and Integrate fact table
hiveapp.CreateTableFactSegment(username=username)
hiveapp.IntegrateFactSegment(username=username)# Create and Integrate data to Fact Inventory Film
Fact_Inventory_Analysis_TextFile=hiveapp.CreateTableFact_Inventory_Analysis_TextFile(username)
Load_data_to_Fact_Inventory_TextFile=hiveapp.LoadData("Fact_Inventory_Analysis.csv",'Fact_Inventory_Analysis_TextFile',username)
Fact_Inventory_Analysis_ORC=hiveapp.CreateTableFact_Inventory_Analysis_ORC(username)
```## Connect Apache Hive on Python
> **Note:** Edit the core-site.xml file in Hadoop, add the proxy configuration section for the user and close the file
```xmlhadoop.proxyuser..hosts
*hadoop.proxyuser..groups
*```
> **Note:** start hiveserver2 before connect
```
hive --service hiveserver2 start
```## Schemas
* Work-flow:
![Hive_App-Work-flow](https://github.com/Narius2030/Data-Mining-with-ApacheHive/assets/94912102/d6051d77-679b-4405-8471-5b4b80183381)
* Galaxy Schema of Data Warehouse:
![Hive_App-Hive Architecture](https://github.com/Narius2030/Data-Mining-with-ApacheHive/assets/94912102/81dc04fd-2387-4cce-962f-a5868adc8cab)