Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sarathchandranpm/data-modelling-creating-a-star-schema
Build a star schema by splitting the original dataset into facts and dimension tables
https://github.com/sarathchandranpm/data-modelling-creating-a-star-schema
Last synced: about 2 months ago
JSON representation
Build a star schema by splitting the original dataset into facts and dimension tables
- Host: GitHub
- URL: https://github.com/sarathchandranpm/data-modelling-creating-a-star-schema
- Owner: SarathchandranPM
- Created: 2024-12-21T15:09:17.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2024-12-21T15:42:50.000Z (about 2 months ago)
- Last Synced: 2024-12-21T16:28:34.576Z (about 2 months ago)
- Size: 30.7 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Sales Data Model Project in Power BI - Part 1
## Overview
This part of the project demonstrates the implementation of a dimensional data model (primarily star schema with snowflake elements) in Power BI using a sales dataset. The model transforms a flat CSV file into an optimized dimensional structure, improving data analysis capabilities and reporting performance.## Data Source
The original dataset was provided as an Excel CSV file containing sales transactions with the following fields:
* ProductID, Date, CustomerID, CampaignID, Units, Product, Category, Segment
* ManufacturerID, Manufacturer, Unit Cost, Unit Price
* Customer details: ZipCode, Email, Name
* Geographic information: City, State, Region, District, Country## Data Model Architecture
![image](https://github.com/user-attachments/assets/54a66e65-240d-4a05-923b-e332c2ddb0b4)
### Fact Table
The central fact table **Sales** contains the core transaction metrics and foreign keys:
* Measures: Units, Unit Cost, Unit Price
* Foreign Keys: ProductID, CustomerID, Date
* Additional Fields: Date, Manufacturer### Dimension Tables
The model includes four dimension tables:#### 1. Dim Products
Contains product-related attributes:
* ProductID (Primary Key)
* Product
* Category
* Segment#### 2. Dim Customers
Stores customer information:
* CustomerID (Primary Key)
* First Name
* Last Name
* ZipCode (Foreign Key to Dim Geography)#### 3. Dim Geography
Contains location hierarchies:
* ZipCode (Primary Key)
* City
* State
* Region
* District
* Country*Note: This dimension introduces a snowflake element as it connects to the fact table through Dim Customers rather than directly.*
#### 4. Dim Date
A dedicated date dimension marked as the official date table:
* Date
* Day of Week
* Month
* Month Number
* Quarter
* Week Number
* Year## Model Characteristics
### Star Schema Implementation
The model primarily follows a star schema design with these key characteristics:
* Centralized fact table (Sales)
* Denormalized dimension tables
* Single join path between fact and most dimensions
* Simplified query paths for better performance### Snowflake Element
The geographic dimension (Dim Geography) introduces a snowflake element by:
* Connecting to the fact table through Dim Customers
* Using ZipCode as the linking field between Dim Customers and Dim Geography## Development Process
The project followed a structured approach to data modeling:1. **Conceptual Modeling**
* Identified core business entities
* Defined relationships between entities
* Determined fact and dimension candidates2. **Logical Modeling**
* Designed the star schema structure
* Defined primary and foreign keys
* Established table relationships3. **Physical Modeling**
* Implemented the model in Power BI
* Created appropriate data types
* Established relationships with correct cardinality
* Optimized for performance## Learning Outcomes
This project provided valuable insights into:
* Star schema design principles and implementation
* Data normalization importance and techniques
* Dimensional modeling best practices
* Power BI data modeling capabilities
* The balance between performance and model complexity# Sales and Budget Data Model Enhancement - Part 2
## Overview:
This phase of the project extends the existing sales data model by incorporating budget data, demonstrating advanced data modeling techniques to handle many-to-many relationships and complex dimensional hierarchies.## New Data Structure
### New Fact Table
**FactBudget** was introduced with the following fields:
* Value (measure)
* Date (foreign key)
* Scenario
* Original Category and Segment fields (later replaced with CatSeg Key)## Data Modeling Challenges and Solutions
### Handling Many-to-Many Relationships
* Identified a potential many-to-many relationship between FactBudget and product dimension tables through Category
* Implemented a bridge table solution to resolve the many-to-many relationship
* Created a new dimension table (Dim CatSeg) to establish proper relationships### Creation of Bridge Dimension
**Dim CatSeg** was created through the following steps:
* Extracted Category and Segment combinations from Dim Category
* Removed duplicates to ensure distinct Category-Segment pairs
* Created a surrogate key (CatSeg Key) using the index column technique
* Established as the central dimension for category-related attributes### Data Transformation Process
1. **Merge Operations**
* Merged Dim CatSeg with FactBudget to incorporate CatSeg Key
* Performed similar merge operation with Dim Product
* Based merge operations on Category and Segment columns2. **Data Cleanup**
* Removed redundant Category and Segment columns from FactBudget
* Cleaned up Dim Product by removing duplicate category information
* Maintained data integrity while reducing model complexity## Model Optimization
### Relationship Management
* Established proper relationships between:
* FactBudget and Dim Date through Date column
* FactBudget and Dim CatSeg through CatSeg Key
* Dim Product and Dim CatSeg through CatSeg Key
## Data Model Architecture
**Final Data Model:**
![image](https://github.com/user-attachments/assets/7d8e4451-9fc1-4686-94c8-4de4c2ab08f7)
**Sales table and related dimension tables:**
![image](https://github.com/user-attachments/assets/5a0eb62f-3f1b-4b74-8d38-7d1fdfe4fc07)
**FactBudget table and related dimension tables:**
![image](https://github.com/user-attachments/assets/5d8c4a3d-dcc5-43ed-b2c4-995d556be7dd)### Performance Considerations
* Implemented surrogate keys for efficient joins
* Removed redundant columns to reduce model size
* Optimized relationship cardinality for better query performance## Technical Skills Demonstrated
### Advanced Data Modeling
* Resolution of many-to-many relationships
* Creation and implementation of bridge tables
* Surrogate key generation and management### Data Transformation
* Complex merge operations
* Duplicate removal and data deduplication
* Column management and cleanup### Dimensional Modeling
* Bridge dimension design and implementation
* Proper key management
* Relationship optimization### Power BI Features Utilized
* Merge queries functionality
* Index column creation
* Relationship management
* Data model optimization techniques## Business Intelligence Best Practices
* Maintained dimensional modeling standards
* Implemented proper key management
* Ensured data integrity across the model
* Created efficient paths for data analysis## Results
The enhanced data model now supports:
* Integrated analysis of sales and budget data
* Proper handling of category-segment hierarchies
* Efficient querying across multiple fact tables
* Streamlined reporting capabilitiesThis phase of the project demonstrates advanced data modeling techniques and showcases the ability to handle complex business requirements while maintaining model efficiency and analytical capabilities.
## Tools Used
* Power BI Desktop
* Excel (source data)