https://github.com/nataliabeltranarg/nosql-dataarchitecture-spark

Implementing core components of a data-driven architecture using Spark: Data Management and Data Analysis Backbones with structured zones in a data lake and analytical capabilities
https://github.com/nataliabeltranarg/nosql-dataarchitecture-spark

data-science dataarchitecture datalake datamanagement java-8 javajdk pyspark spark

Last synced: 4 months ago
JSON representation

Implementing core components of a data-driven architecture using Spark: Data Management and Data Analysis Backbones with structured zones in a data lake and analytical capabilities

Host: GitHub
URL: https://github.com/nataliabeltranarg/nosql-dataarchitecture-spark
Owner: nataliabeltranarg
Created: 2024-08-14T09:15:42.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-08-14T11:32:32.000Z (10 months ago)
Last Synced: 2025-02-02T01:11:28.514Z (4 months ago)
Topics: data-science, dataarchitecture, datalake, datamanagement, java-8, javajdk, pyspark, spark
Language: Jupyter Notebook
Homepage:
Size: 1.4 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Big Data Management: Data-Driven Architecture Project using Spark
This repository contains the implementation of two critical backbones of a data-driven architecture: the Data Management Backbone and the Data Analysis Backbone. The project involves setting up a structured data lake with defined zones and performing either descriptive or predictive analysis.

## Project Overview
This project focuses on creating a data-driven architecture using Apache Spark. It involves setting up a data lake with structured zones on the local file system, processing raw data, and performing analysis.

## Environment SetUp
The following guide will aid in setting up PySpark on Mac (for help with Windows setup, please head to: https://www.machinelearningplus.com/pyspark/install-pyspark-on-windows/).

**Spark Mac**
1. Open terminal
2. Execute following command (make sure to have Homebrew installed)
```brew install openjdk```
3. For successful installation check run the following comands:
```
java -version
whereis java
```
4. Setting up Java_Home environment in shell profile (e.g., ~/bashrc or ~/.zshrc) by running:
```
export JAVA_HOME=/usr/libexec/java_home
```
```
source ~/.bashrc
```
5. Installing Apache Spark ``` brew install apache-spark ```
-> for path info ``` brew info apache-spark ```
6. Setting Environment Variables (replace ***version*** with the installed Spark version)
```
export SPARK_HOME=/usr/local/Cella/apache-spark//libexec
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
source ~/.bashrc
```
8. Install PySpark Python Package
```
pip install pyspark
pyspark --version
```

## Data Management Backbone
**Landing Zone**

Stores raw data ingested into the data lake in a structured or semi-structured format. This includes data directly extracted from source systems with minimal transformation.
- *Implementation:* This would be implemented in a Distributed File System in a real-world scenarion but for the project goal it will be done in my local file system.

**Formatted Zone**

Stores data in a standardized format according to a canonical data model. Data is potentially enriched and in a consumption-ready form.
-*Implementation:* Implemented using Parquet files for efficient storage and schema enforcement on the local file system.

**Exploitation Zone**

Contains processed and refined data optimized for analysis, such as features and KPIs.
- *Implementation:** Implemented using Parquet and CSV files for efficient storage on the local file system.

## Data Analysis Backbone
**Descriptive Analysis and Dashboarding**

***Descriptive Analysis:*** Performed exploratory data analysis (EDA) on the data in the Exploitation Zone to summarize and understand the data.

***Dashboarding:*** Created interactive dashboards using tools like Tableau, Power BI, or Jupyter Notebooks with matplotlib/seaborn.

## How to navigate the repository
```bash
├── Documents
│ ├── BigData_Spark_notebook.ipynb
│ └── BigData_Spark_report.pdf
├── LandingZone
│ ├── cultural-sites
│ ├── income
│ └── price_opendata
├── FormattedZone
│ ├── CulturalSites
│ ├── Income
│ └── PriceOpenData
├── ExploitationZone
│ ├── CulturalSites
│ ├── Income
│ ├── Price_Income
│ └── PriceOpenData
└── README.md
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nataliabeltranarg/nosql-dataarchitecture-spark

Awesome Lists containing this project

README