An open API service indexing awesome lists of open source software.

https://github.com/drmsabdullahi/genpath-pph

GenPath-PPH: Python package for integrating gene expression and KEGG pathway networks via Persistent Path Homology to detect disease-relevant pathway dysregulation. Available on PyPI: pip install genpath-pph
https://github.com/drmsabdullahi/genpath-pph

bioinformatics pathway-analysis pathway-enrichment-analysis persistent-homology persistent-path-homology topological-data-analysis

Last synced: 11 days ago
JSON representation

GenPath-PPH: Python package for integrating gene expression and KEGG pathway networks via Persistent Path Homology to detect disease-relevant pathway dysregulation. Available on PyPI: pip install genpath-pph

Awesome Lists containing this project

README

          

# GenPath-PPH: Gene ExpressioN and PATHway network integration using Persistent Path Homology

This repository contains the code and supplementary material for the research article:
Abdullahi et al., "GenPath-PPH: Integrating Gene Expression and Pathway Networks via Persistent Path Homology Enhances Detection of Disease-Relevant Pathways", *Computational and Structural Biotechnology Journal*, 27, 5348-5362, 2025. https://doi.org/10.1016/j.csbj.2025.11.018

---

## Table of Contents

- [Introduction](#introduction)
- [Datasets](#datasets)
- [GenPath-PPH Analysis](#genpath-pph-analysis)
- [Project Dependencies](#project-dependencies)
- [Installation](#installation)
- [Usage](#usage)
- [Project Structure](#project-structure)
- [License](#license)
- [References](#references)

---

## Introduction

**GenPath-PPH** introduces a novel approach for detecting disease-relevant pathways by integrating gene expression data with biological pathway networks using **persistent path homology (PPH)**. This enables detection of pathways significantly altered in disease conditions, exemplified with **hepatocellular carcinoma (HCC)** from peripheral blood mononuclear cell (PBMC) samples.

---

## Datasets

We used the following datasets in our project.

- **RNA-Seq Gene Expression**: This comprises 26,575 genes from 34 samples (17 healthy, 17 disease). Original data from Han et al.[1], NCBI project PRJNA739257 [here](https://dataview.ncbi.nlm.nih.gov/object/PRJNA739257).

- **KEGG Pathways**: We obtained 251 metabolic and signaling pathways from the KEGG database using the packages; 'AnnotationDbi' (v1.66.0) and 'org.Hs.eg.db' (v3.19.1) in R (v4.4.1).

- **Pathways Networks**: Interaction networks (e.g., activations, inhibitions) parsed from KEGG Markup Language (KGML) files downloaded from the KEGG website (https://www.genome.jp/kegg/pathway.html).

---

## GenPath-PPH Analysis

The core methodology of **GenPath-PPH** consists of the following steps for identifying pathways that are most likely implicated in disease conditions.

### 1. **Data Preparation**
- **Data Peprocessing**: Transcript Per Million (TPM) normalization is applied to the raw gene expression data, followed by log₂ transformation to account for transcript length, sequencing depth, and to stabilize variance.

- **Pathway-Specific Gene Expression**: Gene expression subsets corresponding to each biological pathway are extracted by mapping genes to their respective pathways. This yields pathway-specific datasets representing gene activity patterns within each pathway.

- **Condition Grouping**: Samples are grouped into experimental conditions (e.g., **HCC** vs. **Control**) to examine how gene expression differences influence pathway topology.

### 2. **Persistent Path Homology (PPH) Computation**
For each pathway and condition, PPH is computed to capture (track) topological changes (variations) in the gene expression network:
- **Filtration Scale**: Ranges from 0 to 1, where 0 represents the strongest correlation between genes and 1 represents the weakest.
- **Step Size**: A step of 0.01 yields 101 Betti numbers per pathway and condition across the filtration scale [0,1].

The output comprises two sets of Betti numbers (for **HCC** and **Control**), describing how the network’s topological features evolve with varying correlation strength. This enables a comparative analysis of pathway network alterations between the two conditions.

### 3. **Statistical Significance**

- #### a. **Global Differences**:
Persistence landscapes (PLs) were obtained for each pathway output from the PPH computation. The average landscape was then computed for each group (**HCC** and **Control**). The difference between these average landscapes was quantified using the supremum norm ($\|\cdot\|$∞) and the $L_{1,2}$-norm, and the statistical significance of these differences was evaluated through permutation testing.

- #### b. **Pathway-Level Differences**:
Kolmogorov–Smirnov (KS) tests and Cohen’s $d$ were applied to quantify the statistical significance of the differences observed between the two conditions (**HCC** and **Control**) for each pathway through permutation testing.

The obtained $p$-values were corrected for multiple testing using the Benjamini–Hochberg (BH) method. The adjusted $p$-values or False Discovery Rate (FDR) threshold was set to $< 0.05$ throughout.

### 4. **PPH Implementation Details**

- **Dissimilarity Matrix**: This was constructed using $1 - |\rho|$, where $\rho$ is the Pearson correlation coefficient between genes.

- **Path Complex Construction**: For each pathway and condition, path complexes are built using gene point clouds obtained from the dissimilarity matrix and the underlying pathway network structure.

### 5. **Visualization**

Custom and library-based visualization scripts were implemented to illustrate both network-level and topological features.

- **Custom Visualizations**:
- **Correlation Heatmaps** showing inter-gene relationships within each pathway.
- **Betti Number Evolution Plots** across filtration values.
- **PCA Plots** for sample clustering.
- **UpSet Plots** for pathway overlap and intersection analysis.

- **Library-Based Visualizations (Gudhi)**:
- **Persistence Diagrams** using `plot_persistence_diagram`.
- **Persistence Barcodes** using `plot_persistence_barcode`.

These visualizations provide both qualitative and quantitative insights into the topological evolution and expression variability of pathways under different biological conditions.

---

## Project Dependencies

The following dependencies are required for running the analysis:

### 1. **Python Libraries**
The following Python libraries are required:

- **`persim==0.2.1`**: Used for persistence landscape (PL) related computations.
- **`scipy==1.14.1`**: Used for Kolmogorov-Smirnov (KS) test statistics and effect size computations.
- **`statsmodels==0.13.5`**: Used for applying the Benjamini–Hochberg correction method for multiple testing.
- **`biopython==1.76`**: Used for bioinformatics functionalities.
- **`seaborn==0.11.2`**: Used for advanced statistical visualizations.
- **`matplotlib==3.4.3`**: Used for general plotting.
- **`mpl_toolkits==3.1.1`**: Used for 3D plotting.

### 2. **R Libraries**
The following R packages are required:

- **`annotationDbi==1.66.0`**: Used for database management and gene annotation.
- **`org.Hs.eg.db==3.19.1`**: Homo sapiens genome annotation package.
- **`edgeR==3.42.4`**: Used for data normalization and transformation.

### 3. **GenPath-PPH Core Framework**
This framework builds on a modified version of [PathHom](https://github.com/WeilabMSU/PathHom), where correlation-based dissimilarity and filtration-level edge interaction tracking functions were added to enhance the PPH computation.

### 4. **External Tools Used for Comparison**
The following packages were used only for benchmarking or comparison:
- **`DESeq2 (R package)`**: Used for differential gene expression (DEG) analysis.
- **`gseapy==v1.1.3`**: Used for Gene Set Enrichment Analysis (GSEA).

---

## Core Module and Enhancements

The main computation is implemented in `core.py` through the GenPathHomology class (an extended version of PathHomology). This module performs Persistent Path Homology (PPH) analysis on pathway networks by integrating:
- **Point clouds** derived from gene expression data.
- **Directed graphs (digraphs)** obtained from KEGG pathway networks.

Integration is achieved by masking the point cloud with the corresponding digraph, combining both gene-level and network-level information into a unified representation.

### Enhancements in GenPath-PPH

Compared to the original PathHom implementation, this version introduces several improvements:

- **Flexible distance metrics**: Users can choose among multiple distance options for computing pairwise relationships between genes:
- `euclidean` (used in PathHom)
- `1 - correlation`
- `1 - |correlation|` (default for GenPath-PPH)

- **Filtration-level edge tracking**: In addition to Betti number distributions (series), the function also returns the set of edges active (present) at each filtration level, allowing detailed inspection of the evolving topological structure.

- **Integration-ready output**:
For each pathway and condition, the framework produces:
- Betti number series describing topological changes across filtrations.
- Filtration-level edge sets tracking the evolution of gene interactions.

These outputs can be directly integrated into downstream statistical, biological, or visualization analyses (e.g., group comparisons using persistence landscapes, pathway-level significance testing, tracking gene-gene connectivity dynamics across filtrations to identify stable or condition-specific interactions, quantifying edge persistence or node centrality changes to reveal rewired subnetworks, visualizing edge evolution heatmaps, persistence barcodes, or module reorganization across filtrations).

---

## Installation

### Option 1 — Install from PyPI (recommended)
```bash
pip install genpath-pph
```

### Option 2 — Install from source
```bash
# Clone the repository
git clone https://github.com/DrMSAbdullahi/GenPath-PPH.git
cd GenPath-PPH
```

```bash
# Install the necessary packages
pip install -r requirements.txt
```

---

## Usage

### Quick start (new high-level API)

#### One-liner: single pathway
```python
from genpath_pph import run_pathway
import numpy as np

# X_disease, X_control : np.ndarray (n_genes, n_samples)
# adj_matrix : np.ndarray (n_genes, n_genes)

result = run_pathway(X_disease, X_control, adj_matrix)
print(result)
# PathwayResult(SIGNIFICANT)
# β₀ → KS=0.4231 perm-p=0.002 Cohen-d=0.88 mean-diff=2.31
# β₁ → KS=0.3812 perm-p=0.014 Cohen-d=0.72 mean-diff=0.98
```

#### One-liner: batch over many pathways
```python
from genpath_pph import run_batch
from genpath_pph import extract_adjacency_matrices, extract_pathway_expressions

adj_matrices = extract_adjacency_matrices(select_path_ids, ...)
pathway_exprs = extract_pathway_expressions(select_path_ids, ...)

results_df = run_batch(
pathway_ids = select_path_ids,
adj_matrices = adj_matrices,
pathway_exprs = pathway_exprs,
class_size = 17,
n_permutations = 5000,
)
significant = results_df[results_df["significant"]]
print(f"{len(significant)} significant pathways found")
```

#### Object-oriented interface
```python
from genpath_pph import GenPathAnalysis

model = GenPathAnalysis(n_permutations=1000)
model.fit(X_disease, X_control, adj_matrix)

b0_d, b1_d = model.betti_series("disease")
b0_c, b1_c = model.betti_series("control")
delta0, delta1 = model.delta_betti()
phases = model.phase_summary() # low / mid / high filtration breakdown
result = model.test()
```

### Import the package in Python (low-level API)

```python
from genpath_pph import GenPathHomology, PathwayDataProcessor
import numpy as np

# Suppose you have a toy adjacency matrix for a pathway
adj_matrix = np.array([
[0, 1, 0],
[0, 0, 1],
[1, 0, 0]
])

# Example edges extracted from adjacency matrix
rows, cols = np.nonzero(adj_matrix)
edges = np.column_stack((rows, cols))

# Simulated gene expression for 3 genes, 4 samples
expression_data = np.random.rand(3, 4)

# Initialize PPH object
pph = GenPathHomology()

# Generate allowed paths (dimension 2)
allowed_paths = pph.utils_generate_allowed_paths(edges, max_path=2)

# Compute boundaries
boundary_matrices, ranks, path_idx_flags = pph.utils_unlimited_boundary_operator(allowed_paths, max_path=2)

# Compute Betti numbers for connected digraph
betti_numbers = pph.path_homology_for_connected_digraph(allowed_paths, max_path=2)

# Print the results
print("Allowed paths:", allowed_paths)
print("Boundary matrices:", boundary_matrices)
print("Betti numbers:", betti_numbers)
```

---

### Built-in toy dataset

```python
from genpath_pph import GenPathHomology, load_toy_data

# Load built-in toy expression data and pathway edges
df, edges, filtration = load_toy_data()

# Split into control and disease groups
X_control = df[["C1", "C2", "C3"]].values
X_disease = df[["D1", "D2", "D3"]].values

pph = GenPathHomology()

# Compute Betti-0 for each group
betti_0_control, _ = pph.persistent_path_homology_from_digraph(
X_control, edges, target_dimension=0, filtration=filtration
)
betti_0_disease, _ = pph.persistent_path_homology_from_digraph(
X_disease, edges, target_dimension=0, filtration=filtration
)

print("Control β₀:", [int(x) for x in betti_0_control])
print("Disease β₀:", [int(x) for x in betti_0_disease])
```

### Run some examples

**1. Toy example test** - This is a small simulated gene expression dataset and toy graph used in our manuscript to demonstrate PPH computation:
```bash
python core.py --test
```
Running the above will:
- Generate a small simulated gene expression dataset.
- Compute persistent path homology on a toy graph.
- Print Betti numbers for each dimension across the filtration.

**2. Example for real pathways** - Demonstrates PPH computation for real biological pathways (p53 signaling pathway and ferroptosis):

```bash
python examples/p53_signaling.py
python examples/ferroptosis.py
```

**3. Jupyter Notebook examples** - Detailed analyses on HCC gene expression data, including:
- Comparing PH vs PPH on a toy example.
- Global difference analysis across HCC samples.
- Pathway-level analysis to identify significant pathways.
```bash
jupyter notebook notebooks/toy_example_ph_vs_pph.ipynb
jupyter notebook notebooks/genpath_pph_global.ipynb
jupyter notebook notebooks/genpath_pph_pathway_level.ipynb
```

These examples demonstrate how to compute PPH (Betti number computation), visualize topological features, and interpret identified pathways.

---

## Project Structure

```text
GenPath-PPH/

├── README.md
├── LICENSE.txt
├── requirements.txt # Python dependencies
├── pyproject.toml # PyPI packaging config
├── .github/
│ └── workflows/
│ └── publish.yml # Auto-publish to PyPI
├── genpath_pph/ # Core package
│ ├── __init__.py
│ ├── core.py # Core PPH computations
│ ├── utils.py # Utility functions (allowed paths, boundaries, etc.)
│ ├── run_analysis.py # Optional analysis scripts for custom workflows
│ └── api.py # High-level one-liner API
├── examples/ # Example scripts for toy and real pathways
│ ├── toy_example.py
│ ├── p53_signaling.py
│ └── ferroptosis.py
├── notebooks/ # Jupyter notebooks for detailed analyses
│ ├── toy_example_ph_vs_pph.ipynb # Compare PH vs PPH on toy data
│ ├── genpath_pph_global.ipynb # Global difference analysis on HCC data
│ └── genpath_pph_pathway_level.ipynb # Pathway-level analysis on HCC data
└── data/ # Datasets used/created
```

---

## License

### Code License

The code in this repository is licensed under the **MIT License**. You are free to use, modify, and distribute the code, provided that the following conditions are met:

- The original copyright notice and the MIT License text must be included in all copies or substantial portions of the code.
- The code is provided "as is," without warranty of any kind.

For the full MIT License text, please see the [LICENSE](./LICENSE.txt) file.

### Data License

The data in this repository is licensed under the **Creative Commons Attribution 4.0 International License (CC BY 4.0)**. You are free to:

- **Share**: Copy and redistribute the material in any medium or format.
- **Adapt**: Remix, transform, and build upon the material for any purpose, including commercial purposes.

Under the following conditions:

- **Attribution**: You must provide appropriate credit, give a link to the license, and indicate if changes were made. You may do this in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- **No additional restrictions**: You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

The data is provided "as is," without any warranty of fitness for a particular purpose. Users are solely responsible for their use of the data and any consequences thereof.

For the full Creative Commons Attribution 4.0 International License text, please visit the [CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/).

---

#### References

[1] Han, Z., Feng, W., Hu, R., Ge, Q., Ma, W., Zhang, W., Xu, S., Zhan, B., Zhang, L., Sun, X. et al. (2021). 'RNA-seq profiling reveals PBMC RNA as a potential biomarker for hepatocellular carcinoma.' Sci. Reports, 11, 17797.

[2] Chen, D., Liu, J., Wu, J., Wei, G.-W., Pan, F., Yau, S.-T. (2023). 'Path topology in molecular and materials sciences'. The Journal of Physical Chemistry Letters 14 (4), 954–964.

[3] Chowdhury, S., Mémoli, F. (2018). 'Persistent path homology of directed networks'. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1152–1169.

---

#### Citation

If you use this codebook in your research or publication, please cite:

[Abdullahi, M.S.; Piro, R.M.; Suratanee, A.; Plaimas, K. "GenPath-PPH: Integrating Gene Expression and Pathway Networks via Persistent Path Homology Enhances Detection of Disease-Relevant Pathways". Computational and Structural Biotechnology Journal (Submitted), 27, 5348-5362, 2025. https://doi.org/10.1016/j.csbj.2025.11.018]

For questions or additional permissions, contact [abdullahi.sirajo@udusok.edu.ng].