{"id":31895752,"url":"https://github.com/pasted/scaml","last_synced_at":"2025-10-13T10:50:56.247Z","repository":{"id":316921739,"uuid":"1065346172","full_name":"pasted/SCAML","owner":"pasted","description":"Single Cell RNA analysis pipeline for Seurat outputs of microglial iPSC, using Scikit Learn Models","archived":false,"fork":false,"pushed_at":"2025-09-27T15:52:17.000Z","size":15,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-27T16:27:22.771Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pasted.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-27T14:39:11.000Z","updated_at":"2025-09-27T15:52:21.000Z","dependencies_parsed_at":"2025-09-27T16:27:27.342Z","dependency_job_id":null,"html_url":"https://github.com/pasted/SCAML","commit_stats":null,"previous_names":["pasted/scaml"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/pasted/SCAML","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pasted%2FSCAML","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pasted%2FSCAML/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pasted%2FSCAML/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pasted%2FSCAML/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pasted","download_url":"https://codeload.github.com/pasted/SCAML/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pasted%2FSCAML/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279014645,"owners_count":26085556,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-13T10:50:54.669Z","updated_at":"2025-10-13T10:50:56.241Z","avatar_url":"https://github.com/pasted.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SCAML\n\nMachine-learning for single-cell RNA-seq microglia.\nTrain scikit-learn models (LR / RF / XGB / MLP) to predict cell types/states directly from expression, benchmarked across harvests (H4 vs H8) and with HEK novelty detection. This extends the Seurat analysis from **[scrna_ipsc](https://github.com/pasted/scrna_ipsc)** and turns label transfer into a full ML benchmark.\n\n---\n\n## What’s new (high-level)\n\n* **Robust splits \u0026 labels**\n\n  * Train/test by `harvest`, `culture`, etc. with consistent label encoding.\n  * Works even when train/test see different class sets (reverse split).\n    XGBoost label remapping + probability expansion included.\n\n* **Safer metrics**\n\n  * `classification_report` aligned to the full class list.\n  * Macro OVR ROC-AUC computed robustly; returns `NaN` when test has \u003c2 classes.\n\n* **Saved artifacts for plotting**\n\n  * Per-model: `predictions.csv`, `confusion_matrix.csv`, `metrics.json`, `proba.csv`.\n  * Top-level: `summary_metrics.json`, `classes.txt`, `features.txt`.\n  * New utility: `scaml_plot.py` renders model comparison + confusion matrices.\n\n* **Seurat v5 → AnnData export guidance**\n\n  * Reliable `.h5ad` creation even with Seurat v5 multi-layer assays.\n\n---\n\n## Data expectations (AnnData `.h5ad`)\n\n* `adata.X` — **recommended**: raw counts (SCAML log1p’s internally for HVGs).\n  If you prefer, you can skip HVGs and pass an **embedding** via `--embedding`.\n\n* `adata.obs` (columns expected):\n\n  * `label_ref`: string labels (e.g. Olah 2020 microglia clusters).\n  * `harvest`: e.g. `\"H4\"` / `\"H8\"` (or any split key you choose).\n  * `is_hek` (optional): boolean for novelty detection.\n\n* `adata.obsm` (optional but supported):\n\n  * `X_pca`, `X_umap`, and optionally `X_harmony` (if you ran Harmony in Seurat).\n  * Use `--embedding X_harmony` (or `X_pca`, etc.) to train on embeddings instead of genes.\n\n---\n\n## Export from Seurat v5 → `.h5ad` (robust path)\n\nSeurat v5’s multi-layer assays can break older converters. A reliable approach is to write the AnnData directly from R using the **`anndata`** R package:\n\n```r\n# In R\nlibrary(Seurat)\nlibrary(anndata)\n\nobj \u003c- readRDS(\"seurat_with_labels.rds\")\n\n# Map fields SCAML expects\nstopifnot(\"olah_label\" %in% colnames(obj@meta.data))\nobj$label_ref \u003c- as.character(obj$olah_label)\nobj$is_hek    \u003c- if (\"mg_like\" %in% colnames(obj@meta.data)) !obj$mg_like else FALSE\n\n# Choose expression for adata.X (SCT preferred if present; otherwise RNA)\nassay \u003c- if (\"SCT\" %in% names(obj@assays)) \"SCT\" else \"RNA\"\nDefaultAssay(obj) \u003c- assay\n# Ensure 'data' exists; recompute if needed\nobj \u003c- NormalizeData(obj, assay = assay, verbose = FALSE)\n\n# Build AnnData\nX \u003c- GetAssayData(obj, assay = assay, slot = \"data\")  # log-normalized\nobsm \u003c- list()\nif (\"pca\" %in% names(obj@reductions))     obsm[[\"X_pca\"]]     \u003c- as.matrix(Embeddings(obj, \"pca\"))\nif (\"umap\" %in% names(obj@reductions))    obsm[[\"X_umap\"]]    \u003c- as.matrix(Embeddings(obj, \"umap\"))\nif (\"harmony\" %in% names(obj@reductions)) obsm[[\"X_harmony\"]] \u003c- as.matrix(Embeddings(obj, \"harmony\"))\n\nad \u003c- anndata::AnnData(\n  X   = X,\n  obs = obj@meta.data,\n  var = data.frame(gene = rownames(obj), row.names = rownames(obj)),\n  obsm = obsm\n)\n\ndir.create(\"data\", showWarnings = FALSE)\nad$write_h5ad(\"data/ipsc_mg_all.h5ad\", compression = \"gzip\")\n```\n\n**Verify the file** (size \u003e\u003e 1MB and `h5ls` shows groups like `obs`, `var`, `X`, `obsm`).\nIf you prefer HVGs on raw counts, set `adata.layers[\"counts\"]` too; otherwise SCAML will still work (it log1p’s the selected genes).\n\n\u003e ⚠️ SeuratDisk sometimes creates tiny (invalid) `.h5ad` files under Seurat v5. If your file is ~800 bytes or `h5ls` shows no keys, use the `anndata` route above.\n\n---\n\n## Setup\n\n```bash\n# Python (aligns with CI)\nconda create -y -n scaml python=3.10\nconda activate scaml\n\n# Install\npip install --upgrade pip\npip install -r SCAML/requirements.txt\n```\n\n---\n\n## Train \u0026 Evaluate\n\n### 1) Baseline cross-harvest split\n\n```bash\npython SCAML/scaml_train.py --adata data/ipsc_mg_all.h5ad \\\n  --label-key label_ref \\\n  --split-key harvest --split-train H4 --split-test H8 \\\n  --features hvg --n-hvg 3000 \\\n  --models lr rf xgb mlp \\\n  --outdir results/baseline_H4train_H8test\n```\n\n### 2) Reverse split (robust to class-set differences)\n\n```bash\npython SCAML/scaml_train.py --adata data/ipsc_mg_all.h5ad \\\n  --label-key label_ref \\\n  --split-key harvest --split-train H8 --split-test H4 \\\n  --features hvg --n-hvg 3000 \\\n  --models lr rf xgb mlp \\\n  --outdir results/baseline_H8train_H4test\n```\n\n### 3) Learning curves (group-aware; avoids leakage)\n\n```bash\npython SCAML/scaml_learning_curve.py \\\n  --adata data/ipsc_mg_all.h5ad \\\n  --label-key label_ref --group-key culture \\\n  --features hvg --n-hvg 3000 \\\n  --model lr \\\n  --outdir results/learning_curves\n```\n\n### 4) Novelty detection (HEK outliers)\n\n```bash\npython SCAML/scaml_novelty.py \\\n  --adata data/ipsc_mg_all.h5ad \\\n  --hek-key is_hek --hek-positive true \\\n  --features hvg --n-hvg 3000 \\\n  --method iforest \\\n  --outdir results/novelty\n```\n\n### 5) Plots\n\n```bash\n# Compare models + per-model confusion matrices\npython SCAML/scaml_plot.py \\\n  --results results/baseline_H4train_H8test \\\n  --outdir  results/baseline_H4train_H8test/plots \\\n  --models  lr rf xgb mlp\n\npython SCAML/scaml_plot.py \\\n  --results results/baseline_H8train_H4test \\\n  --outdir  results/baseline_H8train_H4test/plots \\\n  --models  lr rf xgb mlp\n```\n\n**Tip:** If you exported Harmony to `obsm['X_harmony']`, you can train on it:\n\n```bash\npython SCAML/scaml_train.py \\\n  --adata data/ipsc_mg_all.h5ad \\\n  --label-key label_ref \\\n  --split-key harvest --split-train H4 --split-test H8 \\\n  --embedding X_harmony \\\n  --models lr rf xgb mlp \\\n  --outdir results/with_harmony_H4train_H8test\n```\n\n---\n\n## Outputs\n\n**Top-level (per run)**\n\n* `summary_metrics.json` — per-model summary: accuracy, F1 (macro), ROC-AUC (macro OVR; may be `NaN` if test has one class)\n* `classes.txt` — integer ↔︎ label mapping\n* `features.txt` — HVG list or embedding column names\n\n**Per model (e.g., `lr/`, `rf/`, `xgb/`, `mlp/`)**\n\n* `classification_report.txt` — precision/recall/F1 by class\n* `confusion_matrix.csv` — rows = true, cols = predicted\n* `predictions.csv` — cell, true/pred labels + indices\n* `proba.csv` — per-class probabilities (expanded to the full class set)\n* `metrics.json` — accuracy, F1 (macro), ROC-AUC (macro OVR)\n\n**Plots (from `scaml_plot.py`)**\n\n* `plots/model_compare.png` — accuracy \u0026 F1 bars per model\n* `plots/\u003cmodel\u003e_confusion_matrix.png` — heatmaps (row-normalized)\n\n---\n\n## Models\n\n* **LR** — `LogisticRegression` (with `MaxAbsScaler`)\n* **RF** — `RandomForestClassifier`\n* **XGB** — `XGBClassifier` (auto-remaps labels when train/test class sets differ)\n* **MLP** — `MLPClassifier` (dense path with standardization)\n\n---\n\n## Notes \u0026 gotchas\n\n* **HVGs vs counts**: The HVG selector is Seurat-style. For cleanest results, put **raw counts** in `adata.X`; SCAML applies `log1p` on the selected genes. If your `adata.X` is already log data, consider using `--embedding` to avoid HVG warnings.\n\n* **ROC-AUC showing `NaN`**: This is expected if your test split contains only one class. Accuracy/F1 are still valid.\n\n* **SeuratDisk tiny `.h5ad`**: If `h5ad` is ~800 bytes or missing groups (`obs`, `var`, etc.), re-export with the R `anndata` approach above.\n\n* **Group-aware splits**: Prefer `--split-key culture` (or donor/patient) for fair generalization tests; avoid random cell splits.\n\n---\n\n## Dev / CI\n\nLocal:\n\n```bash\npip3 install flake8 pytest\nflake8 .\npytest\n```\n\nGitHub Actions (name: **SCAML**) runs:\n\n* Lint: `flake8` (syntax/complexity)\n* Tests: `pytest` on Python 3.10\n\n---\n\n## Acknowledgements\n\n* Built on top of your Seurat v5 pipeline (SCT, optional Harmony, label transfer to Olah 2020).\n* Uses: Scanpy/AnnData, scikit-learn, (optional) XGBoost, and R packages for export.\n\n## Example results\n\n## 🧬 Context\n\nSCAML here is being used as a **machine learning framework for transcriptomic classification** — evaluating how well expression patterns from **induced pluripotent stem cell (iPSC)-derived microglia** at one differentiation stage (harvest) can predict or distinguish those from another.\n\n* **H4 (Harvest 4)** = Early differentiation stage.\n* **H8 (Harvest 8)** = Later, more mature differentiation stage.\n* Each has **three biological replicates**, and models are trained on one harvest and tested on the other to measure **cross-harvest generalization** — i.e., whether transcriptomic signatures of maturation are reproducible and not chip-specific.\n\nAnalysis of SCAML **H4-trained, H8-tested** model results from `summary_metrics.json`:\n\n| Algorithm                    | Accuracy  | F1-macro  | ROC-AUC (macro, OVR) | Interpretation                                                                   |\n| ---------------------------- | --------- | --------- | -------------------- | -------------------------------------------------------------------------------- |\n| **Logistic Regression (lr)** | 0.764     | 0.509     | 0.898                | Strong linear baseline; balanced precision/recall; good generalization.          |\n| **Random Forest (rf)**       | 0.731     | 0.363     | 0.900                | Slightly lower accuracy and poor F1 → likely class imbalance issue.              |\n| **XGBoost (xgb)**            | **0.771** | **0.544** | **0.901**            | Best performer overall — highest accuracy, F1, and AUC; robust ensemble learner. |\n| **MLP (mlp)**                | 0.745     | 0.389     | 0.871                | Decent but underperforms on F1 and AUC; possibly undertrained or overfitting.    |\n\n---\n\n## Model Performance (iPSC-microglia scRNA-seq)\n\n**Cross-harvest generalization**\n\n### Train on H4 → Test on H8\n![H4→H8 Metrics Bar](figures/scaml_bar_h4_to_h8.svg)\n![H4→H8 Metrics Radar](figures/scaml_radar_h4_to_h8.svg)\n\n### Train on H8 → Test on H4\n![H8→H4 Metrics Bar](figures/scaml_bar_h8_to_h4.svg)\n![H8→H4 Metrics Radar](figures/scaml_radar_h8_to_h4.svg)\n\n\n---\n\n### 🧠 Interpretation\n\n* **Best Algorithm:** ✅ **XGBoost**\n\n  * Highest **accuracy (77.1%)**, **F1-macro (0.54)**, and **ROC-AUC (0.90)** → excellent balance between precision and recall.\n  * Performs slightly better than logistic regression, meaning nonlinear relationships in your SCAML data are important.\n* **Random Forest** shows good AUC but poor F1 → indicates it's predicting the majority class too often (class imbalance sensitivity).\n* **MLP (neural net)** underperforms, suggesting either limited data, suboptimal architecture, or insufficient epochs.\n\n---\n\n### 📊 Recommended next steps\n\n1. **Confirm class distribution** — the low F1 for RF/MLP hints at imbalance; consider **class weighting** or **SMOTE**.\n2. **Inspect confusion matrices** for XGB vs LR to see which cluster classes are misclassified.\n3. **Cross-validate XGB** on multiple train/test splits (not just H4→H8) to check consistency across harvests/chips.\n4. **Model explainability**: use SHAP to interpret XGB feature contributions — useful for biological relevance in transcript curation.\n\n---\n\n**Conclusion:**\n\n\u003e For the SCAML dataset (H4-trained, H8-tested), **XGBoost** is the most effective algorithm — it achieves the best balance of accuracy, class-balanced F1, and overall discrimination (ROC-AUC ≈ 0.90).\n\n\nAnalysis of **SCAML run where models were trained on H8 and tested on H4**:\n\n| Algorithm                    | Accuracy  | F1-macro  | ROC-AUC (macro OVR) | Interpretation                                                                   |\n| ---------------------------- | --------- | --------- | ------------------- | -------------------------------------------------------------------------------- |\n| **Logistic Regression (lr)** | 0.752     | 0.423     | 0.904               | Strong baseline; high AUC; slightly reduced F1 due to imbalance.                 |\n| **Random Forest (rf)**       | 0.744     | 0.350     | 0.899               | Similar to before — AUC fine, but poor recall on minority classes.               |\n| **XGBoost (xgb)**            | **0.779** | **0.571** | **0.905**           | Best across all metrics; consistent top performance between H4↔H8 tests.         |\n| **MLP (mlp)**                | 0.757     | 0.420     | 0.875               | Weaker AUC and F1 — again suggests underfitting or sensitivity to training size. |\n\n---\n\n### 🔍 Comparison to H4-trained → H8-tested results\n\n| Direction   | Best Model | Accuracy  | F1-macro  | ROC-AUC   | Notes                                |\n| ----------- | ---------- | --------- | --------- | --------- | ------------------------------------ |\n| **H4 → H8** | XGBoost    | 0.771     | 0.544     | 0.901     | Strong cross-harvest generalization  |\n| **H8 → H4** | XGBoost    | **0.779** | **0.571** | **0.905** | Slightly improved in both F1 and AUC |\n\n✅ **XGBoost again emerges as the top model** — and importantly, it generalizes well **in both directions** (H4→H8 and H8→H4).\nThis symmetry suggests the learned patterns are **stable across harvests/chips**, which supports biological rather than purely technical signal.\n\n---\n\n### 🧠 Interpretation summary\n\n* **Cross-harvest stability:** XGBoost achieves AUC ≈ 0.90 in both directions — excellent discriminative power.\n* **F1-macro ≈ 0.55–0.57**: indicates balanced detection of both true/false or multi-class Seurat cluster labels.\n* **MLP underperformance** likely due to limited data or architecture mismatch.\n* **Logistic regression** performs surprisingly well given simplicity → worth keeping as a fast baseline.\n\n---\n\n### 🧠 Biological Interpretation\n\n* **Cross-harvest reproducibility:** High AUC values (~0.90) in both directions confirm that transcriptomic maturation signatures of iPSC-microglia are robust to chip and replicate variation.\n* **F1-macro (~0.54–0.57)** shows the model captures both early and late populations with balanced recall.\n* **XGBoost** consistently outperforms simpler (LR) and more complex (MLP) models — suggesting that **nonlinear but structured gene-expression relationships** dominate differentiation dynamics.\n\n---\n\n### 🏁 Conclusion\n\n\u003e Across both training directions (H4 → H8 and H8 → H4), **XGBoost** is the most effective algorithm for distinguishing early and late iPSC-microglial transcriptomic states.\n\u003e It achieves the highest accuracy, balanced F1, and ROC-AUC, demonstrating that the maturation-associated gene-expression signatures it learns are biologically meaningful and reproducible across independent harvests and chips.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpasted%2Fscaml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpasted%2Fscaml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpasted%2Fscaml/lists"}