{"id":25535565,"url":"https://github.com/jpotter80/notebook-examples","last_synced_at":"2025-06-12T22:39:27.797Z","repository":{"id":277391364,"uuid":"932268121","full_name":"jpotter80/notebook-examples","owner":"jpotter80","description":"This repository demonstrates a systematic approach to cleaning and standardizing e-commerce product data using DuckDB. The notebook serves as a detailed walkthrough of our data cleaning methodology, showcasing how we handle common data quality challenges in e-commerce datasets.","archived":false,"fork":false,"pushed_at":"2025-02-13T17:10:01.000Z","size":1610,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-13T18:23:30.581Z","etag":null,"topics":["data-analysis","data-cleaning","jupyter-notebook"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jpotter80.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-13T16:35:11.000Z","updated_at":"2025-02-13T17:10:04.000Z","dependencies_parsed_at":"2025-02-13T18:23:38.647Z","dependency_job_id":null,"html_url":"https://github.com/jpotter80/notebook-examples","commit_stats":null,"previous_names":["jpotter80/notebook-examples"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jpotter80%2Fnotebook-examples","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jpotter80%2Fnotebook-examples/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jpotter80%2Fnotebook-examples/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jpotter80%2Fnotebook-examples/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jpotter80","download_url":"https://codeload.github.com/jpotter80/notebook-examples/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239774484,"owners_count":19694751,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-cleaning","jupyter-notebook"],"created_at":"2025-02-20T04:21:58.418Z","updated_at":"2025-02-20T04:21:59.204Z","avatar_url":"https://github.com/jpotter80.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# E-commerce Data Cleaning with DuckDB\n## Data Quality Assessment and Cleaning Demonstration\n\nThis repository demonstrates a systematic approach to cleaning and standardizing e-commerce product data using DuckDB. The notebook serves as a detailed walkthrough of our data cleaning methodology, showcasing how we handle common data quality challenges in e-commerce datasets.\n\n## Overview\n\nThis demonstration tackles a common scenario in e-commerce: consolidating product data from multiple sources while ensuring data quality and consistency. We use DuckDB, a high-performance analytical database, to process data directly from various file formats without intermediate transformations.\n\n### Source Files\n\nThe demonstration uses synthetic data that represents common e-commerce data scenarios:\n\n1. `main_catalog.csv`: Primary product catalog containing basic product information\n   - SKUs, product names, categories, base prices, inventory levels\n   - Common issues: inconsistent SKU formats, missing data\n\n2. `inventory_update.xlsx`: Recent inventory updates in Excel format\n   - SKUs, current inventory levels, last update timestamps\n   - Common issues: conflicts with main catalog, different date formats\n\n3. `price_list.json`: Latest pricing information in JSON format\n   - SKUs, current prices\n   - Common issues: inconsistent price formats (with/without currency symbols)\n\n4. `category_mapping.parquet`: Category standardization mapping in Parquet format\n   - SKUs, standardized categories and subcategories\n   - Common issues: inconsistent capitalization, missing mappings\n\n## Data Cleaning Process\n\n### 1. SKU Standardization\n- Collects SKUs from all data sources\n- Implements consistent formatting rules\n- Maintains mapping between original and standardized SKUs\n- Handles various format inconsistencies\n\n### 2. Price Normalization\n- Standardizes price formats across sources\n- Removes currency symbols\n- Converts to consistent decimal format\n- Resolves conflicts between catalog and price list\n\n### 3. Inventory Reconciliation\n- Combines inventory data from multiple sources\n- Implements \"most recent update wins\" logic\n- Ensures non-negative inventory values\n- Tracks inventory update timestamps\n\n### 4. Category Standardization\n- Implements consistent capitalization rules\n- Resolves category hierarchies\n- Handles missing subcategories\n- Maintains source category mappings\n\n## Technical Implementation\n\n### Key Technologies\n- **DuckDB**: In-process analytical SQL database\n- **Python**: Primary programming language\n- **Jupyter Notebook**: Interactive development and documentation\n- **SQL**: Data transformation and cleaning logic\n\n### Notable Features\n1. **Direct File Processing**\n   - Reads CSV, Excel, JSON, and Parquet without intermediate steps\n   - Reduces memory overhead and simplifies pipeline\n\n2. **SQL-First Approach**\n   - Leverages SQL for complex data transformations\n   - Maintains clarity and performance\n\n3. **Quality Control**\n   - Built-in verification steps\n   - Detailed statistics at each stage\n   - Data quality flags in final output\n\n4. **Multiple Export Formats**\n   - Parquet for full precision\n   - CSV for broad compatibility\n   - Excel for business users\n\n## Results and Verification\n\nThe notebook includes comprehensive verification steps:\n\n1. **Record Completeness Analysis**\n   - Tracks missing data across fields\n   - Calculates completion percentages\n   - Identifies data quality issues\n\n2. **Statistical Verification**\n   - Price ranges and averages\n   - Inventory levels and statistics\n   - Category and subcategory counts\n\n3. **Export Validation**\n   - Row count verification\n   - Value range checks\n   - Format-specific validations\n\n## Output Files\n\nThe process generates three versions of the cleaned dataset:\n\n1. `cleaned_combined_products.parquet`\n   - Full precision\n   - Efficient storage and querying\n   - Ideal for further processing\n\n2. `cleaned_combined_products.csv`\n   - Universal compatibility\n   - Full precision values\n   - Human-readable format\n\n3. `cleaned_combined_products.xlsx`\n   - Business-user friendly\n   - Formatted for easy viewing\n   - Suitable for direct use\n\n## Usage\n\nThe notebook is designed to be both educational and practical:\n\n1. **Educational Use**\n   - Step-by-step explanations\n   - Clear documentation\n   - Verification at each stage\n\n2. **Production Template**\n   - Modular design\n   - Configurable paths\n   - Reusable functions\n\n3. **Client Demonstration**\n   - Shows methodology\n   - Demonstrates capabilities\n   - Highlights quality controls\n\n## Requirements\n\n- Python 3.12+\n- DuckDB\n    - spatial extension for .xlsx\n- Jupyter Notebook\n- Additional packages:\n  - pandas\n  - numpy\n\n## License\n\nThe MIT License\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjpotter80%2Fnotebook-examples","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjpotter80%2Fnotebook-examples","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjpotter80%2Fnotebook-examples/lists"}