https://github.com/omnipotence-eth/manufacturing-quality-analytics
SQL + Python pipeline for semiconductor NCR analysis — supplier performance, defect Pareto, yield trends
https://github.com/omnipotence-eth/manufacturing-quality-analytics
analytics data-analysis etl manufacturing matplotlib pandas postgresql python quality sql
Last synced: 3 months ago
JSON representation
SQL + Python pipeline for semiconductor NCR analysis — supplier performance, defect Pareto, yield trends
- Host: GitHub
- URL: https://github.com/omnipotence-eth/manufacturing-quality-analytics
- Owner: omnipotence-eth
- License: mit
- Created: 2026-04-07T23:20:03.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-09T11:08:12.000Z (3 months ago)
- Last Synced: 2026-04-09T12:16:01.492Z (3 months ago)
- Topics: analytics, data-analysis, etl, manufacturing, matplotlib, pandas, postgresql, python, quality, sql
- Language: Jupyter Notebook
- Size: 1.06 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
# Manufacturing Quality Analytics
**SQL + Python pipeline for semiconductor NCR analysis — supplier performance, defect Pareto, yield trends**
[](https://github.com/omnipotence-eth/manufacturing-quality-analytics/actions/workflows/ci.yml)
[](https://python.org)
[](https://www.postgresql.org)
[](https://github.com/astral-sh/ruff)
[](LICENSE)
[Architecture](#architecture) · [Key Findings](#key-findings) · [SQL Highlights](#sql-highlights) · [Visualizations](#visualizations) · [Quick Start](#quick-start) · [Contributing](CONTRIBUTING.md)
---
## What is This?
A production-grade data analytics pipeline that ingests 2,500 simulated Non-Conformance Reports (NCRs) from a semiconductor manufacturing operation, loads them into PostgreSQL, and answers 10 business questions with SQL — then renders 7 professional visualizations in a Jupyter notebook.
The dataset is synthetically generated from real manufacturing quality domain knowledge (GM Quality + Shield AI supplier quality experience): realistic supplier defect curves, lot sizes drawn from log-normal distributions, a corrective action improvement trajectory baked into the worst-performing supplier, and shift-level quality variation matching industry norms.
> **What this demonstrates**: SQL fluency (CTEs, window functions, Pareto, self-joins), a clean Python data pipeline, manufacturing domain expertise, and production engineering habits — the exact combination DA/DE interviews test.
---
## Why
Manufacturing quality data lives in spreadsheets and disconnected databases. This project demonstrates how to build a proper analytics pipeline — PostgreSQL for structured storage, Python for ETL and statistical analysis, and SQL queries that answer real questions about defect rates, supplier performance, and process capability. Built from experience in automotive and aerospace quality engineering, it models the kind of NCR (non-conformance report) analysis that quality teams actually need but rarely have automated.
---
## Architecture
```mermaid
graph LR
A["generate_synthetic.py\n2,500 NCRs"] --> B["data/raw/\nquality_records.csv"]
B --> C["load_data.py\nSQLAlchemy · psycopg2"]
C --> D[("PostgreSQL 18\nmanufacturing_qa")]
D --> E["sql/queries.sql\n10 business queries"]
D --> F["analysis.ipynb\nSQLAlchemy connection"]
E -.->|reference| F
F --> G["7 Visualizations\nMatplotlib · Seaborn"]
G --> H["visuals/*.png"]
```
### Data model
| Column | Type | Description |
|--------|------|-------------|
| `ncr_number` | `varchar` | Unique NCR ID — `NCR-202301-0001` format |
| `supplier_name` / `supplier_tier` | `varchar` / `int` | Supplier identity and qualification tier (1–3) |
| `production_line` / `shift` | `varchar` | Where and when the defect was found |
| `defect_code` / `defect_type` | `varchar` | 15-category defect taxonomy (D001–C015) |
| `quantity_received` / `quantity_rejected` | `int` | Lot size and rejection volume |
| `defect_rate` | `float` | Rejection rate for this NCR event |
| `opened_date` / `closed_date` | `date` | NCR lifecycle timestamps |
| `days_to_close` | `int` | Disposition cycle time |
| `disposition` | `varchar` | Use As Is / Rework / Return to Supplier / Scrap |
| `first_pass` | `bool` | Whether the lot passed first inspection |
---
## Key Findings
1. **FastTrack Supply** leads defect rate at **6.9%** — 8.1× higher than best-in-class AeroParts Manufacturing (0.85%). Corrective action plan implemented July 2023 drove a measurable improvement through H2 2023.
2. **Dimensional and Surface Finish defects** account for **40% of all rejected units** — Pareto-validated. These two categories are the only ones that warrant dedicated inspection protocols.
3. **Night shift** runs **1.4× higher defect rate** than Day shift across all production lines. The gap is largest in Electronics Integration (LINE\_B), flagging a staffing or training gap on nights.
4. **4 of 7 suppliers** triggered the rolling 30-day repeat-offender threshold (3+ NCRs in 30 days), indicating systemic lot-level problems rather than random variation — mandatory CAP criteria met.
---
## SQL Highlights
### Rolling 30-day repeat offender detection
```sql
-- Suppliers with 3+ NCRs in any rolling 30-day window
WITH windowed AS (
SELECT
supplier_name,
opened_date,
COUNT(*) OVER (
PARTITION BY supplier_name
ORDER BY opened_date
RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW
) AS ncrs_in_30d_window
FROM quality_records
)
SELECT DISTINCT
supplier_name,
MAX(ncrs_in_30d_window) OVER (PARTITION BY supplier_name) AS max_ncrs_in_any_30d
FROM windowed
WHERE ncrs_in_30d_window >= 3
ORDER BY max_ncrs_in_any_30d DESC;
```
### Composite supplier scorecard (PERCENT_RANK + CTE)
```sql
-- Weighted composite: defect 50%, response time 30%, FPY 20%
scored AS (
SELECT *,
ROUND((100.0 * (1 - PERCENT_RANK() OVER (ORDER BY defect_rate_pct DESC)))::numeric, 1) AS defect_score,
ROUND((100.0 * (1 - PERCENT_RANK() OVER (ORDER BY avg_close_days DESC)))::numeric, 1) AS response_score,
ROUND((PERCENT_RANK() OVER (ORDER BY fpy_pct) * 100)::numeric, 1) AS fpy_score
FROM supplier_stats
)
SELECT *,
ROUND(0.50 * defect_score + 0.30 * response_score + 0.20 * fpy_score, 1) AS composite_score,
RANK() OVER (ORDER BY (...) DESC) AS overall_rank
FROM scored;
```
### Pareto with running cumulative total
```sql
ROUND(
100.0 * SUM(total_rejected) OVER (
ORDER BY total_rejected DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) / SUM(total_rejected) OVER (),
2) AS cumulative_pct
```
**All 10 queries:** defect rate by supplier, Pareto of defect types, monthly trend with rolling average, yield by production line, average time to disposition, repeat offender window function, defect rate by shift, co-occurring defect self-join, first pass yield by month, composite supplier scorecard.
---
## Visualizations
### Defect Rate by Supplier

### Pareto of Defect Types

### Monthly NCR Trend with Rolling Average

### Defect Rate Heatmap — Production Line × Shift

### Supplier Quality Scorecard

### First Pass Yield by Month

### Shift Quality Comparison

---
## Quick Start
### Prerequisites
- Python 3.11+
- PostgreSQL 18 (local or Docker)
- conda or pip
### Install
```bash
git clone https://github.com/omnipotence-eth/manufacturing-quality-analytics.git
cd manufacturing-quality-analytics
pip install -r requirements.txt
```
### Configure
```bash
cp .env.example .env
# Edit .env:
# DATABASE_URL=postgresql://user:password@localhost:5432/manufacturing_qa
```
### Run
```bash
# 1. Generate synthetic dataset (2,500 NCRs → data/raw/quality_records.csv)
python src/generate_synthetic.py
# 2. Create database and load data
# createdb manufacturing_qa (if not already created)
python src/load_data.py
# 3. Open the analysis notebook
jupyter lab notebooks/analysis.ipynb
# Run all cells — charts export automatically to visuals/
# 4. Run standalone SQL queries
psql $DATABASE_URL -f sql/queries.sql
```
### Tests
```bash
pytest -q
```
---
## Tech Stack
View full stack
| Layer | Technology | Notes |
|-------|-----------|-------|
| **Data generation** | Python, NumPy | Log-normal lot sizes, realistic supplier defect curves, FastTrack CAP improvement trajectory |
| **Data pipeline** | Pandas, SQLAlchemy 2.x | Type enforcement, null guards, batch insert, schema confirmation |
| **Database** | PostgreSQL 18, psycopg2 | `quality_records` table — 20 columns, 2,500 rows |
| **SQL** | PostgreSQL SQL | CTEs, window functions (`RANGE BETWEEN`, `PERCENT_RANK`, `RANK`), self-joins, `PERCENTILE_CONT` |
| **Analysis** | Jupyter Lab, Pandas | All queries run via SQLAlchemy connection — no CSV re-reads |
| **Visualization** | Matplotlib, Seaborn | `seaborn-v0_8-whitegrid` style, consistent palette, exported PNG at 150 DPI |
| **Config** | python-dotenv | `DATABASE_URL` from `.env` — never hardcoded |
| **Code quality** | Ruff, mypy | Line length 100, `from __future__ import annotations`, typed public signatures |
| **Testing** | pytest | 20 unit tests for synthetic data generator — schema, ranges, domain invariants |
| **CI** | GitHub Actions | lint (ruff check + format) → test (pytest) on every push and PR |
---
## Documentation
| Document | Contents |
|----------|---------|
| [CONTRIBUTING.md](CONTRIBUTING.md) | Branch strategy, ship workflow, audit checklist, commit standards, PR checklist |
| [CHANGELOG.md](CHANGELOG.md) | Version history |
| [SECURITY.md](SECURITY.md) | Security model and vulnerability reporting |
| [sql/queries.sql](sql/queries.sql) | All 10 annotated standalone SQL queries |
---
## License
MIT — see [LICENSE](LICENSE).