https://github.com/jose-zothner-meyer/retail-data-centralisation
  
  
    In an effort to become more data-driven, your organisation would like to make its sales data accessible from one centralised location.  Your first goal will be to produce a system that stores the current company data in a database so that it's accessed from one centralised location and acts as a single source of truth for sales data. 
    https://github.com/jose-zothner-meyer/retail-data-centralisation
  
        Last synced: 7 months ago 
        JSON representation
    
In an effort to become more data-driven, your organisation would like to make its sales data accessible from one centralised location. Your first goal will be to produce a system that stores the current company data in a database so that it's accessed from one centralised location and acts as a single source of truth for sales data.
- Host: GitHub
- URL: https://github.com/jose-zothner-meyer/retail-data-centralisation
- Owner: jose-zothner-meyer
- License: mit
- Created: 2024-11-12T11:49:37.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-01-13T15:54:16.000Z (9 months ago)
- Last Synced: 2025-01-29T01:45:48.635Z (9 months ago)
- Language: Jupyter Notebook
- Size: 5.17 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- 
            Metadata Files:
            - Readme: README.md
- License: LICENSE
 
Awesome Lists containing this project
README
          # Multinational Retail Data Centralisation Project
## **Context**
In today's data-driven landscape, the organisation aims to consolidate its sales data into a **single centralised database** to serve as the **single source of truth** for all sales-related analysis. This initiative will improve data accessibility, consistency, and reliability across the organisation, enabling better decision-making and insights.
This project involves creating a system that:
- Centralises retail sales data from multiple sources.
- Cleanses and structures the data for efficient storage and retrieval.
- Implements a **star-schema** database model to enable advanced querying and analysis.
---
## **Aim**
The goal is to explore the end-to-end process of sourcing, cleaning, and centralising retail sales data into a database. By completing this project, I will gain hands-on experience in building a data pipeline that can extract, transform, and load (ETL) data from diverse sources into a structured database.
### **Objectives**
1. Extract sales data from multiple sources:
   - **AWS RDS**, **AWS S3**, and **PDF files**.
2. Clean and transform the data to ensure accuracy and consistency.
3. Push the cleansed data into a database as structured tables.
4. Design and implement a **star-schema** database model for querying.
5. Query the database to perform **sales analysis** and extract actionable insights.
---
## **Skills and Concepts**
### **Key Prerequisites**
- **Object-Oriented Programming (OOP)**:
  - Designing classes, methods, and functions for modular and reusable code.
- **Pandas**:
  - Reading, cleaning, and transforming data from various sources.
  - Pushing data into a database.
- **AWS**:
  - Using `boto3` for programmatic interaction with AWS services.
  - Configuring AWS resources via the command-line interface (CLI).
- **SQL**:
  - Creating database models (e.g., **star-schema**).
  - Writing SQL queries for data extraction and analysis.
### **Other Prerequisites**
- **APIs**:
  - Extracting store data programmatically.
- **Tabula**:
  - Extracting structured data from PDF files (requires Java installation and configuration).
- **Data Formats**:
  - Working with YAML, JSON, and CSV file types.
---
## **Secrets Management Using YAML**
- To securely manage sensitive information such as API keys, S3 bucket credentials, and database connection details, this project uses **YAML files**.
- YAML files store secrets in a structured and readable format, making them ideal for configuration while maintaining security practices.
  
### **Use Cases for YAML in This Project**
1. **API Connection Keys**:
   - Store API keys and tokens required for accessing external services (e.g., store data APIs).
   - Example:
     ```yaml
     api_keys:
       store_api_key: "YOUR_STORE_API_KEY"
     ```
2. **S3 Bucket Credentials**:
   - Store AWS credentials (e.g., access key, secret key, and bucket name) to access data in S3.
   - Example:
     ```yaml
     aws:
       access_key: "YOUR_ACCESS_KEY"
       secret_key: "YOUR_SECRET_KEY"
       bucket_name: "YOUR_BUCKET_NAME"
     ```
3. **Database Connection Details**:
   - Store connection strings and authentication details for the PostgreSQL database.
   - Example:
     ```yaml
     database:
       host: "YOUR_DATABASE_HOST"
       user: "YOUR_DATABASE_USER"
       password: "YOUR_DATABASE_PASSWORD"
       db_name: "YOUR_DATABASE_NAME"
     ```
### **Benefits of Using YAML for Secrets Management**
- **Separation of Code and Secrets**:
  - Keeps sensitive information separate from the application logic, improving security and maintainability.
- **Environment-Specific Configurations**:
  - Allows for easily switching configurations between development, staging, and production environments.
- **Human-Readable Format**:
  - YAML's syntax is easy to read and edit, making it accessible for both developers and operations teams.
### **Security Note**
- Always add your YAML files containing secrets to `.gitignore` to prevent them from being accidentally pushed to version control systems like GitHub.
- Use tools like **AWS Secrets Manager** or **Vault** for enhanced secrets management in production environments.
---
## **Software Requirements**
- **VSCode**: Integrated Development Environment (IDE) for writing and testing code.
- **Conda**: Package manager for managing dependencies and creating isolated environments.
- **pgAdmin4** or **SQLTools**: Database interface for PostgreSQL.
---
## **Outcome**
By completing this project, I will have:
- Built a robust ETL pipeline for retail sales data.
- Centralised the data into a single database for analysis.
- Designed a scalable **star-schema** model for efficient querying.
- Gained practical experience in data engineering and analytics, bridging raw data to actionable insights.
- Implemented secure **secrets management** for sensitive configurations using YAML.