Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/benitomartin/de-hotel-reviews
Data Engineering Hotel Reviews
https://github.com/benitomartin/de-hotel-reviews
cicd data-engineering dbt gcp jupyter-notebook looker prefect python spark sql terraform
Last synced: 3 days ago
JSON representation
Data Engineering Hotel Reviews
- Host: GitHub
- URL: https://github.com/benitomartin/de-hotel-reviews
- Owner: benitomartin
- Created: 2023-08-06T17:56:58.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-04-20T18:00:32.000Z (9 months ago)
- Last Synced: 2024-11-08T10:10:04.060Z (about 2 months ago)
- Topics: cicd, data-engineering, dbt, gcp, jupyter-notebook, looker, prefect, python, spark, sql, terraform
- Language: Python
- Homepage:
- Size: 116 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Engineering Hotel Reviews
This is a personal data engineering project based on a hotel reviews [Kaggle](https://www.kaggle.com/datasets/jiashenliu/515k-hotel-reviews-data-in-europe) dataset.
Below you can find some instructions to understand the project content. Feel free to ⭐ and clone this repo 😉
## Tech Stack and Tools
![Visual Studio Code](https://img.shields.io/badge/Visual%20Studio%20Code-0078d7.svg?style=for-the-badge&logo=visual-studio-code&logoColor=white)
![Jupyter Notebook](https://img.shields.io/badge/jupyter-%23FA0F00.svg?style=for-the-badge&logo=jupyter&logoColor=white)
![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)
![Pandas](https://img.shields.io/badge/pandas-%23150458.svg?style=for-the-badge&logo=pandas&logoColor=white)
![Anaconda](https://img.shields.io/badge/Anaconda-%2344A833.svg?style=for-the-badge&logo=anaconda&logoColor=white)
![Apache Spark](https://img.shields.io/badge/Apache%20Spark-E25A1C.svg?style=for-the-badge&logo=Apache-Spark&logoColor=white)
![Prefect](https://img.shields.io/badge/Prefect-024DFD.svg?style=for-the-badge&logo=Prefect&logoColor=white)
![dbt](https://img.shields.io/badge/dbt-FF694B.svg?style=for-the-badge&logo=dbt&logoColor=white)
![Linux](https://img.shields.io/badge/Linux-FCC624?style=for-the-badge&logo=linux&logoColor=white)
![Ubuntu](https://img.shields.io/badge/Ubuntu-E95420?style=for-the-badge&logo=ubuntu&logoColor=white)
![Google Cloud](https://img.shields.io/badge/GoogleCloud-%234285F4.svg?style=for-the-badge&logo=google-cloud&logoColor=white)
![Looker Studio](https://img.shields.io/badge/Looker-4285F4.svg?style=for-the-badge&logo=Looker&logoColor=white)
![Terraform](https://img.shields.io/badge/terraform-%235835CC.svg?style=for-the-badge&logo=terraform&logoColor=white)
![Git](https://img.shields.io/badge/git-%23F05033.svg?style=for-the-badge&logo=git&logoColor=white)* Data Analysis & Exploration: **SQL/Python**
* Cloud: **Google Cloud Platform**
* Data Lake - **Google Cloud Storage**
* Data Warehouse: **BigQuery**
* Infrastructure as Code (IaC): **Terraform**
* Workflow Orchestration: **Prefect**
* Distributed Processing: **Spark**
* Data Transformation: **dbt**
* Data Visualization: **Looker Studio**
* CICD: **Git**, **dbt**## Project Structure
The project has been structured with the following folders and files:
* `.github:` contains the CI/CD files (GitHub Actions)
* `data:` raw dataset, saved parquet files and data processed using Spark
* `dbt:` data transformation and CI/CD pipeline using dbt
* `flows:` workflow orchestration pipeline
* `images:` printouts of results
* `looker:` reports from looker studio
* `notebooks:` EDA performed at the beginning of the project to establish a baseline
* `spark:` batch processing pipeline using spark
* `terraform:` IaC stream-based pipeline infrastructure in GCP using Terraform
* `Makefile:` set of execution tasks
* `.pre-commit-config.yaml`: pre-commit configuration file
* `pre-commit.md:` readme file of the pre-commit hooks
* `pyproject.toml:` linting and formatting
* `requirements.txt:` project requirements## Project Description
The dataset was obtained from [Kaggle](https://www.kaggle.com/datasets/jiashenliu/515k-hotel-reviews-data-in-europe) and contains various columns with hotel details and reviews of 5 countries ('Austria', 'France', 'Italy', 'Netherlands', 'Spain', 'UK'). To prepare the data an **Exploratory Data Analysis** was conducted. The following actions are performed either using pandas or spark to get a clean data set:
* Remove rows with NaN
* Remove duplicates
* Create a new column with the country nameAfterwards, some columns have been selected the final clean data are ingested to a GCP Bucket and Big Query. This is done either using **Prefect** (see [flows](./flows) folder), **dbt** (see [dbt](./dbt) folder) or **Spark** (see [spark](./spark) folder).
Prefect Data Ingestion
dbt Data Ingestion
Spark Data Ingestion
## Visualization
## CI/CD
Finally, to streamline the development process, a fully automated **CI/CD** pipeline was created using GitHub Actions and dbt as well:
dbt CI/CD
GitHub Actions CI/CD
## Project Set Up
The Python version used for this project is Python 3.9.
1. Clone the repo (or download it as zip):
```bash
git clone https://github.com/benitomartin/de-hotel-reviews.git
```2. Create the virtual environment named `main-env` using Conda with Python version 3.9:
```bash
conda create -n main-env python=3.9
conda activate main-env
```3. Execute the `requirements.txt` script and install the project dependencies:
```bash
pip install -r requirements.txtor
make install
```4. Install terraform:
```bash
conda install -c conda-forge terraform
```Each project folder contains a **README.md** file with instructions about how to run the code. I highly recommend creating a virtual environment for each one. Additionally, please note that a **GCP Account**, credentials, and proper **IAM** roles are necessary for the scripts to function correctly. The following IAM Roles have been used for this project:
* BigQuery Admin
* BigQuery Data Editor
* BigQuery Job User
* BigQuery User
* Dataproc Administrator
* Storage Admin
* Storage Object Admin
* Storage Object Creator
* Storage Object Viewer
* Viewer## Best Practices
The following best practices have been implemented:
* :white_check_mark: Makefile
* :white_check_mark: CI/CD pipeline
* :white_check_mark: Linter and code formatter
* :white_check_mark: Pre-commit hooks