https://github.com/jose-zothner-meyer/retail-data-centralisation
In an effort to become more data-driven, your organisation would like to make its sales data accessible from one centralised location. Your first goal will be to produce a system that stores the current company data in a database so that it's accessed from one centralised location and acts as a single source of truth for sales data.
https://github.com/jose-zothner-meyer/retail-data-centralisation
Last synced: about 2 months ago
JSON representation
In an effort to become more data-driven, your organisation would like to make its sales data accessible from one centralised location. Your first goal will be to produce a system that stores the current company data in a database so that it's accessed from one centralised location and acts as a single source of truth for sales data.
- Host: GitHub
- URL: https://github.com/jose-zothner-meyer/retail-data-centralisation
- Owner: jose-zothner-meyer
- License: mit
- Created: 2024-11-12T11:49:37.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-01-13T15:54:16.000Z (4 months ago)
- Last Synced: 2025-01-29T01:45:48.635Z (3 months ago)
- Language: Jupyter Notebook
- Size: 5.17 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Multinational Retail Data Centralisation Project
## **Context**
In today's data-driven landscape, the organisation aims to consolidate its sales data into a **single centralised database** to serve as the **single source of truth** for all sales-related analysis. This initiative will improve data accessibility, consistency, and reliability across the organisation, enabling better decision-making and insights.This project involves creating a system that:
- Centralises retail sales data from multiple sources.
- Cleanses and structures the data for efficient storage and retrieval.
- Implements a **star-schema** database model to enable advanced querying and analysis.---
## **Aim**
The goal is to explore the end-to-end process of sourcing, cleaning, and centralising retail sales data into a database. By completing this project, I will gain hands-on experience in building a data pipeline that can extract, transform, and load (ETL) data from diverse sources into a structured database.### **Objectives**
1. Extract sales data from multiple sources:
- **AWS RDS**, **AWS S3**, and **PDF files**.
2. Clean and transform the data to ensure accuracy and consistency.
3. Push the cleansed data into a database as structured tables.
4. Design and implement a **star-schema** database model for querying.
5. Query the database to perform **sales analysis** and extract actionable insights.---
## **Skills and Concepts**
### **Key Prerequisites**
- **Object-Oriented Programming (OOP)**:
- Designing classes, methods, and functions for modular and reusable code.
- **Pandas**:
- Reading, cleaning, and transforming data from various sources.
- Pushing data into a database.
- **AWS**:
- Using `boto3` for programmatic interaction with AWS services.
- Configuring AWS resources via the command-line interface (CLI).
- **SQL**:
- Creating database models (e.g., **star-schema**).
- Writing SQL queries for data extraction and analysis.### **Other Prerequisites**
- **APIs**:
- Extracting store data programmatically.
- **Tabula**:
- Extracting structured data from PDF files (requires Java installation and configuration).
- **Data Formats**:
- Working with YAML, JSON, and CSV file types.---
## **Secrets Management Using YAML**
- To securely manage sensitive information such as API keys, S3 bucket credentials, and database connection details, this project uses **YAML files**.
- YAML files store secrets in a structured and readable format, making them ideal for configuration while maintaining security practices.
### **Use Cases for YAML in This Project**
1. **API Connection Keys**:
- Store API keys and tokens required for accessing external services (e.g., store data APIs).
- Example:
```yaml
api_keys:
store_api_key: "YOUR_STORE_API_KEY"
```
2. **S3 Bucket Credentials**:
- Store AWS credentials (e.g., access key, secret key, and bucket name) to access data in S3.
- Example:
```yaml
aws:
access_key: "YOUR_ACCESS_KEY"
secret_key: "YOUR_SECRET_KEY"
bucket_name: "YOUR_BUCKET_NAME"
```
3. **Database Connection Details**:
- Store connection strings and authentication details for the PostgreSQL database.
- Example:
```yaml
database:
host: "YOUR_DATABASE_HOST"
user: "YOUR_DATABASE_USER"
password: "YOUR_DATABASE_PASSWORD"
db_name: "YOUR_DATABASE_NAME"
```### **Benefits of Using YAML for Secrets Management**
- **Separation of Code and Secrets**:
- Keeps sensitive information separate from the application logic, improving security and maintainability.
- **Environment-Specific Configurations**:
- Allows for easily switching configurations between development, staging, and production environments.
- **Human-Readable Format**:
- YAML's syntax is easy to read and edit, making it accessible for both developers and operations teams.### **Security Note**
- Always add your YAML files containing secrets to `.gitignore` to prevent them from being accidentally pushed to version control systems like GitHub.
- Use tools like **AWS Secrets Manager** or **Vault** for enhanced secrets management in production environments.---
## **Software Requirements**
- **VSCode**: Integrated Development Environment (IDE) for writing and testing code.
- **Conda**: Package manager for managing dependencies and creating isolated environments.
- **pgAdmin4** or **SQLTools**: Database interface for PostgreSQL.---
## **Outcome**
By completing this project, I will have:
- Built a robust ETL pipeline for retail sales data.
- Centralised the data into a single database for analysis.
- Designed a scalable **star-schema** model for efficient querying.
- Gained practical experience in data engineering and analytics, bridging raw data to actionable insights.
- Implemented secure **secrets management** for sensitive configurations using YAML.