{"id":50469661,"url":"https://github.com/menchelab/ppi_layng","last_synced_at":"2026-06-01T09:32:40.515Z","repository":{"id":354367592,"uuid":"1165868723","full_name":"menchelab/ppi_layng","owner":"menchelab","description":" pipeline that fuses PPI + Gene Ontology graphs into protein embeddings and 3D layouts (UMAP/PaCMAP with de-compression diagnostics) for large interaction networks.","archived":false,"fork":false,"pushed_at":"2026-04-28T08:11:58.000Z","size":112,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-28T10:12:29.367Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/menchelab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-24T16:24:21.000Z","updated_at":"2026-04-28T08:12:02.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/menchelab/ppi_layng","commit_stats":null,"previous_names":["menchelab/ppi_layng"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/menchelab/ppi_layng","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/menchelab%2Fppi_layng","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/menchelab%2Fppi_layng/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/menchelab%2Fppi_layng/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/menchelab%2Fppi_layng/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/menchelab","download_url":"https://codeload.github.com/menchelab/ppi_layng/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/menchelab%2Fppi_layng/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33769491,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-01T02:00:06.963Z","response_time":115,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-01T09:32:38.783Z","updated_at":"2026-06-01T09:32:40.495Z","avatar_url":"https://github.com/menchelab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# VERY EXPERIMENTAL - PPI 3D layout w/ GO Ontology\n\nPipeline for 3D embeddings of large PPI networks (~20k nodes) by fusing the interaction graph with Gene Ontology: unified graph → node2vec walks → skip-gram embeddings → manifold projection (UMAP/PaCMAP) with optional de-compression (hub scaling, LOF denoise, force-directed refinement).\n\n**Input:** `input/edges.tsv` (columns `source`, `target`; protein IDs).  \n**Data:** `data/go-basic.obo`, `data/goa_human.gaf` (step 1 downloads if missing).  \n**Env:** RAPIDS (cudf, cugraph, cuml), PyTorch+CUDA, `pip install -r requirements.txt`. Optional: `trimap` for TriMAP in step 6.\n\n---\n\n## Process files (run in order)\n\n**`1_download_data.py`**  \nDownloads `go-basic.obo` and `goa_human.gaf.gz` to `data/` (skips if present). Decompresses GAF. Uses `config` URLs.\n\n**`2_build_graph.py`**  \nBuilds a single undirected graph: (1) GO term–term edges from OBO (`is_a`, `part_of` via obonet); (2) protein–GO edges from GAF, restricted to PPI proteins and OBO terms; (3) PPI edges from `edges.tsv`. Nodes get a linear index; node types (protein vs term) stored. Writes `output/edge_list.parquet` (int `src`, `dst`) and `output/node_map.parquet` (`int_id`, `str_id`, `node_type`). Uses cuGraph for construction.\n\n**`3_run_walks.py`**  \nLoads the graph with cugraph, runs node2vec (biased random walks with return/in-out params `p`, `q`). One walk per (node × `WALKS_PER_NODE`), length `WALK_LENGTH`. Outputs `output/walks.parquet` and `.npy` (shape num_walks × walk_length). `q \u003e 1` (config) favours structural equivalence.\n\n**`4_train_embeddings.py`**  \nReads walks, extracts (center, context) pairs with a fixed window; negative sampling. Trains a skip-gram model (PyTorch, GPU) to embed node IDs. Writes `output/embeddings.parquet` and `.npy` (all nodes × `EMBED_DIM`). Only protein rows are used later for layout.\n\n**`5_project_3d.py`**  \nSubsets embeddings to proteins, runs cuML UMAP (3 components, config `n_neighbors`, `min_dist`). Writes `output/layout.tsv`: `node_id`, `x`, `y`, `z`.\n\n**`5_multiproject_3d.py`** (alternative to 5)  \nSame inputs; runs 20 UMAP parameter variants + 3 PaCMAP variants; each gets columns `x_\u003cname\u003e`, `y_\u003cname\u003e`, `z_\u003cname\u003e`. Writes `output/layout_multi.tsv`.\n\n**`6_decompression_layouts.py`**  \nLoads protein embeddings. Applies three preprocesses (raw, zscore, lognorm) to reduce hub dominance. For each: (1) UMAP 3D (if cuML available); (2) PaCMAP 3D with expansion params (n_neighbors=70, MN_ratio=0.6, FP_ratio=2.0). LOF: top 1% “ambiguous” nodes (by LOF in high-D) excluded from projection, then placed back at k-NN centroid in high-D (PaCMAP and UMAP). Optional TriMAP on raw. Force-directed refinement (k-NN repulsion, ~75 iters) on selected PaCMAP and UMAP layouts. Single TSV: `output/layout_decompression.tsv` with one `x_\u003capproach\u003e`, `y_\u003capproach\u003e`, `z_\u003capproach\u003e` per method (e.g. `umap_zscore`, `pacmap_expansion_lof`, `umap_zscore_fd`).\n\n**`7_add_go_terms.py`**  \nFrom `edge_list` + `node_map`, infers protein → GO term IDs from annotation edges. Optionally loads OBO for ID→name; builds comma-separated `go_terms` (IDs) and `go_terms_readable` (names, commas stripped inside names). Merges into the chosen layout TSV (prefer `layout_decompression.tsv` \u003e `layout_multi.tsv` \u003e `layout.tsv`). Usage: `python 7_add_go_terms.py [layout.tsv] [out.tsv]`.\n\n**`8_distribution_charts.py`**  \nReads the final layout TSV (same precedence as step 7), discovers all `x_*`, `y_*`, `z_*` method columns, and plots one row per method with three density histograms (x, y, z). Writes `output/distribution_charts.png`. Usage: `python 8_distribution_charts.py [layout.tsv]`.\n\n**`10_add_generic_umap_layout.py`**  \nBuilds a protein-protein adjacency matrix from `edge_list.parquet`, runs a generic 3D UMAP on that adjacency-feature space, and appends `x_umap_adjacency`, `y_umap_adjacency`, `z_umap_adjacency` to the final TSV for side-by-side comparison. Usage: `python 10_add_generic_umap_layout.py [layout_in.tsv] [layout_out.tsv]`.\n\n**`11_add_generic_pacmap_layout.py`**  \nBuilds a protein-protein adjacency matrix from `edge_list.parquet`, runs a regular 3D PaCMAP baseline (non-expansion) on adjacency-derived features, and appends `x_pacmap_adjacency`, `y_pacmap_adjacency`, `z_pacmap_adjacency` to the final TSV. Usage: `python 11_add_generic_pacmap_layout.py [layout_in.tsv] [layout_out.tsv]`.\n\n**`12.1_umap_weighted_concat.py`**  \nHybrid feature fusion baseline: concatenate standardized model embeddings with standardized adjacency-SVD features using a tunable weight, then run UMAP. Appends `x/y/z_umap_fused_concat_*`. Usage: `python 12.1_umap_weighted_concat.py --weight 0.10`.\n\n**`12.2_umap_late_fusion.py`**  \nLate fusion baseline: run UMAP separately on embeddings and adjacency-SVD, align layouts (Procrustes), then blend in 3D with weight `w`. Appends `x/y/z_umap_fused_lateblend_*`. Usage: `python 12.2_umap_late_fusion.py --weight 0.20`.\n\n**`12.3_umap_distance_fusion.py`**  \nDistance fusion baseline: blend cosine distance matrices from embeddings and adjacency-SVD, then run UMAP with `metric=precomputed`. Appends `x/y/z_umap_fused_distance_*`. Usage: `python 12.3_umap_distance_fusion.py --weight 0.20` (memory-heavy on large N).\n\n**`12.4_umap_graph_diffusion.py`**  \nTopology smoothing baseline: diffuse embeddings over row-normalized PPI adjacency for `steps`, then UMAP. Appends `x/y/z_umap_graph_diffusion_*`. Usage: `python 12.4_umap_graph_diffusion.py --beta 0.80 --steps 2`.\n\n**`12.5_umap_multiview_knn_union.py`**  \nMulti-view graph baseline: build kNN affinity graphs from embeddings and adjacency-SVD, fuse them by weight, reduce via SVD, and run UMAP. Appends `x/y/z_umap_multiview_knn_union_*`. Usage: `python 12.5_umap_multiview_knn_union.py --weight 0.30 --knn 30`.\n\nCoordinate normalization is **config-driven** in `config_tune.py` (`LAYOUT_NORMALIZE_COORDS`, default `False`). Keep it off to preserve native manifold geometry; enable it only for cross-method visual comparability. To check per-method spread and flag collapsed layouts, run `python -m utils.distribution_analysis output/layout_decompression.tsv`.\n\n---\n\n## Integrity helper scripts\n\n- `python -m utils.check_graph_integrity` — validates node/edge type counts, annotation coverage, and degree structure from `edge_list` + `node_map`.\n- `python -m utils.check_walk_integrity` — validates walk padding, node-0 frequency, unique-node coverage, and per-position invalid rates.\n- `python -m utils.check_embedding_integrity` — checks embedding norms, near-constant dimensions, and sampled cosine-similarity spread.\n- `python -m utils.check_layout_integrity [layout.tsv]` — compares per-method spread, center mass, and sampled nearest-neighbor distances.\n- `python -m utils.pipeline_diagnostics [output/diagnostics_report.md]` — runs all major checks and writes one markdown report.\n\n---\n\n## Outputs\n\n| File | Contents |\n|------|----------|\n| `output/layout.tsv` | `node_id`, `x`, `y`, `z` (single UMAP). |\n| `output/layout_multi.tsv` | `node_id` + many `x_*/y_*/z_*` (UMAP/PaCMAP sweeps). |\n| `output/layout_decompression.tsv` | `node_id` + `x_*/y_*/z_*` for umap_*, pacmap_expansion_*, *_fd, optional trimap_3d. |\n| After step 7 | Same layout + `go_terms`, `go_terms_readable`. |\n| `output/distribution_charts.png` | One row per layout method, 3 cols (x, y, z distributions); from step 8. |\n\n**Config:** `config.py` (paths, graph/embed params), `config_tune.py` (tuned training and projection defaults).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmenchelab%2Fppi_layng","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmenchelab%2Fppi_layng","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmenchelab%2Fppi_layng/lists"}