{"id":50365668,"url":"https://github.com/jiachengwang-punch/predictive-analytics-skill","last_synced_at":"2026-05-30T04:02:01.348Z","repository":{"id":360410506,"uuid":"1249789350","full_name":"jiachengwang-punch/predictive-analytics-skill","owner":"jiachengwang-punch","description":"A reusable, multi-model, language-adaptive methodology for end-to-end machine learning analysis of tabular data.","archived":false,"fork":false,"pushed_at":"2026-05-26T08:57:31.000Z","size":35,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-26T10:26:48.768Z","etag":null,"topics":["claude-skill","codex-skill","data-analysis","data-science","deepseek","feature-engineering","lightgbm","llm","machine-learning","methodology","prompt-engineering","tabular-data"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jiachengwang-punch.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-26T03:22:40.000Z","updated_at":"2026-05-26T09:57:18.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/jiachengwang-punch/predictive-analytics-skill","commit_stats":null,"previous_names":["jiachengwang-punch/predictive-analytics-skill"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/jiachengwang-punch/predictive-analytics-skill","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jiachengwang-punch%2Fpredictive-analytics-skill","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jiachengwang-punch%2Fpredictive-analytics-skill/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jiachengwang-punch%2Fpredictive-analytics-skill/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jiachengwang-punch%2Fpredictive-analytics-skill/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jiachengwang-punch","download_url":"https://codeload.github.com/jiachengwang-punch/predictive-analytics-skill/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jiachengwang-punch%2Fpredictive-analytics-skill/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33679306,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-30T02:00:06.278Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["claude-skill","codex-skill","data-analysis","data-science","deepseek","feature-engineering","lightgbm","llm","machine-learning","methodology","prompt-engineering","tabular-data"],"created_at":"2026-05-30T04:01:58.454Z","updated_at":"2026-05-30T04:02:01.340Z","avatar_url":"https://github.com/jiachengwang-punch.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Predictive Analytics Skill\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n[![Release](https://img.shields.io/github/v/release/jiachengwang-punch/predictive-analytics-skill)](https://github.com/jiachengwang-punch/predictive-analytics-skill/releases)\n![Claude Skill](https://img.shields.io/badge/Claude-Skill-8A2BE2)\n![Multi-model](https://img.shields.io/badge/LLM-multi--model-blue)\n![Language-adaptive](https://img.shields.io/badge/output-language--adaptive-green)\n\nA complete, reusable, **multi-model** and **language-adaptive** methodology for end-to-end machine learning analysis of structured (tabular) data — from raw data to a validated, interpretable predictive model and an honest, decision-oriented report.\n\nThe methodology is **domain-agnostic**: it works for customer churn, sales/demand forecasting, risk/default prediction, site selection, sensor/monitoring data, and any other \"I have a dataset and want to predict or understand Y\" task. A city-noise dataset is used as the running example throughout the guides.\n\n## Why this exists\n\nMost data-analysis tutorials stop at \"run a model and report the score.\" This skill encodes the *analysis spirit* that separates rigorous work from score-chasing: detect data leakage before anything else, always compare against a baseline, cross-validate, diagnose where the model fails, and explain the result. It packages that discipline into a form any LLM can follow.\n\n## Dual-format architecture\n\nTwo entry points, one shared core — so the same methodology runs on any model:\n\n```\n                 Two entry points — same methodology\n        ┌────────────────────────┬────────────────────────┐\n        │  SKILL.md              │  METHODOLOGY.md          │\n        │  (YAML header)         │  (plain prompt)          │\n        │  → for Claude          │  → for any other LLM     │\n        └───────────┬────────────┴───────────┬─────────────┘\n                    │         shares          │\n                    ▼                         ▼\n              ┌──────────────────────────────────────┐\n              │  references/ — platform-neutral core  │\n              │  7 deep-dive guides (one per stage)   │\n              └───────────────┬──────────────────────┘\n                    │                         │\n                    ▼                         ▼\n            ┌───────────────┐       ┌──────────────────────────┐\n            │  Claude        │       │  ChatGPT · Gemini         │\n            │  install as    │       │  DeepSeek · Qwen · Kimi   │\n            │  Skill or paste │       │  any LLM via API          │\n            └───────────────┘       └──────────────────────────┘\n```\n\n- **`SKILL.md`** — Claude entry point. Has a YAML header so it installs directly as a Claude Skill.\n- **`METHODOLOGY.md`** — plain-prompt entry point. Paste it into any other LLM (ChatGPT, Gemini, DeepSeek, Qwen, Kimi, or any model via API) as a system prompt or conversation opener.\n- **`references/`** — the shared, platform-neutral core: seven deep-dive guides, one per stage. Both entry points point to these. The methodology and its rigor are identical across models; only the loading mechanism differs.\n\n## Language support\n\nBoth entry points carry a highest-priority instruction: **detect the language of the user's request and produce the entire analysis in that language** — section titles, explanations, result interpretations, chart labels, and the final report. Ask in English, get English; ask in Chinese, get Chinese (including matplotlib chart labels).\n\n## The 7-stage workflow\n\n| Stage | Guide | What it covers |\n|-------|-------|----------------|\n| 1. Data import \u0026 cleaning | `references/01_data_import_cleaning.md` | Load data, understand fields, quality checks, **data-leakage detection**, time/type handling |\n| 2. Exploratory data analysis | `references/02_eda.md` | Distributions, group comparisons, correlations, time/space/category patterns, chart-font setup |\n| 3. Feature engineering | `references/03_feature_engineering.md` | Remove leakage/ID features, derive features, encode categoricals, feature–target analysis |\n| 4. Modeling | `references/04_modeling.md` | Baseline → ensemble progression, classification \u0026 regression, class-imbalance handling |\n| 5. Evaluation \u0026 diagnosis | `references/05_evaluation_diagnosis.md` | Right metrics, K-fold CV, grid search, **residual diagnosis**, data-driven thresholds |\n| 6. Clustering | `references/06_clustering.md` | Profile aggregation, KMeans, **PCA for honest high-dim visualization** |\n| 7. Interpretability \u0026 output | `references/07_interpretability_output.md` | Feature importance, SHAP, honest model math, report composition |\n\n## Agent ensemble\n\nBeyond the linear 7-stage pipeline, the skill ships a **4-agent ensemble** that adds independent, often adversarial perspectives so an analysis gets thought through from more angles. Each agent has a portable role-prompt in `references/agents/` (the single source of truth — paste into any LLM) and, for Claude Code, a thin plugin wrapper in `agents/` that points to it.\n\n| Agent | When | What it does |\n|-------|------|--------------|\n| **methodology-architect** | Before modeling | Analysis blueprint: task type, method tier by sample/feature constraints, split \u0026 validation plan, leakage-risk list |\n| **model-diagnostician** | After a good score | Adversarially hunts failure: residual structure, subgroup performance, calibration, leakage suspicion |\n| **tool-scout** | When selection is uncertain | **Searches the web** for a better-fitting library/model — LightGBM is the default, not automatically best; flags clearly when offline |\n| **professor-reviewer** | Before delivery | Professor-perspective review + grade; sends methodology flaws back with a remediation checklist (does not auto-rerun) |\n\nA rigorous loop: architect (plan) → stages 1–7 → diagnostician (find flaws) → revise → professor-reviewer (final gate) → iterate; call tool-scout whenever model choice is in doubt. In Claude Code, install the repo as a plugin to use the agents natively; in any other LLM, paste the matching `references/agents/*.md` as a role prompt.\n\n\u003e **tool-scout in practice — when to reach for CatBoost.** LightGBM is the workhorse default, but on **categorical-heavy data** (many high-cardinality categorical columns, few numeric features), **CatBoost** is often the better fit: it handles categoricals natively (no manual one-hot) and uses ordered target encoding that guards against target leakage, frequently edging out LightGBM on such datasets at the cost of slower training. Reach for it for churn / recommendation / transaction-style tables dominated by categorical fields; stick with LightGBM when data is large and mostly numeric or training speed matters. When in doubt, dispatch tool-scout to verify the current best fit for your specific data.\n\n## Installation \u0026 usage\n\nRepository: \u003chttps://github.com/jiachengwang-punch/predictive-analytics-skill\u003e\n\n```bash\ngit clone https://github.com/jiachengwang-punch/predictive-analytics-skill.git\n```\n\n### For Claude\n\n**Option A — install as a Skill (recommended):** package the folder into a `.skill` file and install it via Claude's skill settings. Once installed, it triggers automatically when you ask to analyze a dataset or build a model.\n\n**Option B — paste in chat:** open `SKILL.md` and paste its contents into the conversation, then describe your task and attach your data.\n\n### For other LLMs (ChatGPT, Gemini, DeepSeek, Qwen, Kimi, etc.)\n\nOpen `METHODOLOGY.md`, copy its entire contents, and paste it into the model as a system prompt or at the start of your conversation. Then describe your data-analysis task and provide your data (or its description). If your environment supports file uploads, you can also attach the relevant `references/` guide for the stage you're working on.\n\n## Repository structure\n\n```\npredictive-analytics-skill/\n├── .claude-plugin/\n│   └── plugin.json       # Plugin manifest (use the repo as a Claude Code plugin)\n├── SKILL.md              # Claude entry point (YAML header)\n├── METHODOLOGY.md        # Plain-prompt entry point (other LLMs)\n├── agents/               # Claude plugin agent wrappers (point to references/agents/)\n│   ├── methodology-architect.md\n│   ├── model-diagnostician.md\n│   ├── tool-scout.md\n│   └── professor-reviewer.md\n├── references/           # Shared platform-neutral core\n│   ├── agents/           # Portable agent role-prompts (single source of truth)\n│   │   ├── methodology-architect.md\n│   │   ├── model-diagnostician.md\n│   │   ├── tool-scout.md\n│   │   └── professor-reviewer.md\n│   ├── 01_data_import_cleaning.md\n│   ├── 02_eda.md\n│   ├── 03_feature_engineering.md\n│   ├── 04_modeling.md\n│   ├── 05_evaluation_diagnosis.md\n│   ├── 06_clustering.md\n│   └── 07_interpretability_output.md\n├── README.md\n├── LICENSE\n├── CONTRIBUTING.md\n├── CHANGELOG.md\n└── .gitignore\n```\n\n## Core principles (the \"analysis spirit\")\n\nThese apply at every stage and are the heart of the methodology:\n\n- **Screen for leakage, proportional to risk.** For every feature, ask: \"Is this truly available at prediction time, or is it a downstream result of the target?\" Leakage inflates scores but destroys real predictive value. The screening question is always worth asking; how hard you investigate scales with data provenance (high-risk: temporal prediction, target-derived features, multi-table joins, full-data encoding — low-risk: clean cross-sectional data with exogenous features).\n- **Honesty over a pretty number.** Report negative results, expose where the model fails, state limitations.\n- **Always compare against a baseline.** A complex model must beat a simple one to justify itself.\n- **Match method to constraint.** Sample size, data type, and interpretability needs drive method choice — not fashion.\n- **Let the data tell you thresholds.** Use data-driven splits over guessed cutoffs — but remember tree models auto-find splits, so manual binning helps linear models, not trees.\n- **Cross-validate and interpret.** Don't trust a single split; don't ship a black box without explanation.\n\n## Tech stack (for the code patterns in the guides)\n\nPython with `pandas`, `scikit-learn`, `lightgbm`, `shap`, `matplotlib`, `seaborn`, `scipy`. The guides contain copy-adaptable code patterns; adapt them to your dataset and language.\n\n## License\n\nReleased under the MIT License. See [LICENSE](LICENSE).\n\n## Contributing\n\nContributions are welcome — see [CONTRIBUTING.md](CONTRIBUTING.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjiachengwang-punch%2Fpredictive-analytics-skill","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjiachengwang-punch%2Fpredictive-analytics-skill","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjiachengwang-punch%2Fpredictive-analytics-skill/lists"}