https://github.com/jpotter80/notebook-examples
This repository demonstrates a systematic approach to cleaning and standardizing e-commerce product data using DuckDB. The notebook serves as a detailed walkthrough of our data cleaning methodology, showcasing how we handle common data quality challenges in e-commerce datasets.
https://github.com/jpotter80/notebook-examples
data-analysis data-cleaning jupyter-notebook
Last synced: 8 months ago
JSON representation
This repository demonstrates a systematic approach to cleaning and standardizing e-commerce product data using DuckDB. The notebook serves as a detailed walkthrough of our data cleaning methodology, showcasing how we handle common data quality challenges in e-commerce datasets.
- Host: GitHub
- URL: https://github.com/jpotter80/notebook-examples
- Owner: jpotter80
- Created: 2025-02-13T16:35:11.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-02-13T17:10:01.000Z (12 months ago)
- Last Synced: 2025-02-13T18:23:30.581Z (12 months ago)
- Topics: data-analysis, data-cleaning, jupyter-notebook
- Language: Jupyter Notebook
- Homepage:
- Size: 1.54 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# E-commerce Data Cleaning with DuckDB
## Data Quality Assessment and Cleaning Demonstration
This repository demonstrates a systematic approach to cleaning and standardizing e-commerce product data using DuckDB. The notebook serves as a detailed walkthrough of our data cleaning methodology, showcasing how we handle common data quality challenges in e-commerce datasets.
## Overview
This demonstration tackles a common scenario in e-commerce: consolidating product data from multiple sources while ensuring data quality and consistency. We use DuckDB, a high-performance analytical database, to process data directly from various file formats without intermediate transformations.
### Source Files
The demonstration uses synthetic data that represents common e-commerce data scenarios:
1. `main_catalog.csv`: Primary product catalog containing basic product information
- SKUs, product names, categories, base prices, inventory levels
- Common issues: inconsistent SKU formats, missing data
2. `inventory_update.xlsx`: Recent inventory updates in Excel format
- SKUs, current inventory levels, last update timestamps
- Common issues: conflicts with main catalog, different date formats
3. `price_list.json`: Latest pricing information in JSON format
- SKUs, current prices
- Common issues: inconsistent price formats (with/without currency symbols)
4. `category_mapping.parquet`: Category standardization mapping in Parquet format
- SKUs, standardized categories and subcategories
- Common issues: inconsistent capitalization, missing mappings
## Data Cleaning Process
### 1. SKU Standardization
- Collects SKUs from all data sources
- Implements consistent formatting rules
- Maintains mapping between original and standardized SKUs
- Handles various format inconsistencies
### 2. Price Normalization
- Standardizes price formats across sources
- Removes currency symbols
- Converts to consistent decimal format
- Resolves conflicts between catalog and price list
### 3. Inventory Reconciliation
- Combines inventory data from multiple sources
- Implements "most recent update wins" logic
- Ensures non-negative inventory values
- Tracks inventory update timestamps
### 4. Category Standardization
- Implements consistent capitalization rules
- Resolves category hierarchies
- Handles missing subcategories
- Maintains source category mappings
## Technical Implementation
### Key Technologies
- **DuckDB**: In-process analytical SQL database
- **Python**: Primary programming language
- **Jupyter Notebook**: Interactive development and documentation
- **SQL**: Data transformation and cleaning logic
### Notable Features
1. **Direct File Processing**
- Reads CSV, Excel, JSON, and Parquet without intermediate steps
- Reduces memory overhead and simplifies pipeline
2. **SQL-First Approach**
- Leverages SQL for complex data transformations
- Maintains clarity and performance
3. **Quality Control**
- Built-in verification steps
- Detailed statistics at each stage
- Data quality flags in final output
4. **Multiple Export Formats**
- Parquet for full precision
- CSV for broad compatibility
- Excel for business users
## Results and Verification
The notebook includes comprehensive verification steps:
1. **Record Completeness Analysis**
- Tracks missing data across fields
- Calculates completion percentages
- Identifies data quality issues
2. **Statistical Verification**
- Price ranges and averages
- Inventory levels and statistics
- Category and subcategory counts
3. **Export Validation**
- Row count verification
- Value range checks
- Format-specific validations
## Output Files
The process generates three versions of the cleaned dataset:
1. `cleaned_combined_products.parquet`
- Full precision
- Efficient storage and querying
- Ideal for further processing
2. `cleaned_combined_products.csv`
- Universal compatibility
- Full precision values
- Human-readable format
3. `cleaned_combined_products.xlsx`
- Business-user friendly
- Formatted for easy viewing
- Suitable for direct use
## Usage
The notebook is designed to be both educational and practical:
1. **Educational Use**
- Step-by-step explanations
- Clear documentation
- Verification at each stage
2. **Production Template**
- Modular design
- Configurable paths
- Reusable functions
3. **Client Demonstration**
- Shows methodology
- Demonstrates capabilities
- Highlights quality controls
## Requirements
- Python 3.12+
- DuckDB
- spatial extension for .xlsx
- Jupyter Notebook
- Additional packages:
- pandas
- numpy
## License
The MIT License