{"id":44964482,"url":"https://github.com/bioinfomachinelearning/grnformer","last_synced_at":"2026-03-04T23:00:56.040Z","repository":{"id":274503758,"uuid":"642643154","full_name":"BioinfoMachineLearning/GRNformer","owner":"BioinfoMachineLearning","description":"Transformer models for predicting gene regulatory networks from omics data","archived":false,"fork":false,"pushed_at":"2026-02-12T03:44:13.000Z","size":266525,"stargazers_count":15,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-02-18T16:35:34.924Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BioinfoMachineLearning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-05-19T03:05:16.000Z","updated_at":"2026-02-12T02:05:30.000Z","dependencies_parsed_at":"2025-01-27T18:41:39.409Z","dependency_job_id":"e110e51f-03c2-4b2f-95ab-2ec6e037a90d","html_url":"https://github.com/BioinfoMachineLearning/GRNformer","commit_stats":null,"previous_names":["bioinfomachinelearning/grnformer"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/BioinfoMachineLearning/GRNformer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BioinfoMachineLearning%2FGRNformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BioinfoMachineLearning%2FGRNformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BioinfoMachineLearning%2FGRNformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BioinfoMachineLearning%2FGRNformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BioinfoMachineLearning","download_url":"https://codeload.github.com/BioinfoMachineLearning/GRNformer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BioinfoMachineLearning%2FGRNformer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30098078,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-04T22:49:54.894Z","status":"ssl_error","status_checked_at":"2026-03-04T22:49:48.883Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-18T14:09:32.861Z","updated_at":"2026-03-04T23:00:56.018Z","avatar_url":"https://github.com/BioinfoMachineLearning.png","language":"Python","readme":"# GRNFormer - Accurate Gene Regulatory Network Inference Using Graph Transformer\n[![DOI](https://zenodo.org/badge/1170957484.svg)](https://doi.org/10.5281/zenodo.18868394)\n\nGRNFormer is an advanced variational graph transformer autoencoder model designed to accurately infer regulatory relationships between transcription factors (TFs) and target genes from single-cell RNA-seq transcriptomics data, while supporting generalization across species and cell types.\n\n![GRNFormer](./GRNFormer_overview.png?raw=true \"The Overview of GRNFormer Pipeline\")\n\n## Overview\n\nGRNFormer consists of three main novel designs:\n\n1. **TFWalker**: A de-novo Transcription Factor (TF) centered subgraph sampling method to extract local or neighborhood co-expression of a transcription factor (TF) to facilitate GRN inference.\n\n2. **End-to-End Learning**:\n   - **GeneTranscoder**: A transformer encoder representation module for encoding single-cell RNA-seq (scRNA-seq) gene expression data across different species and cell types.\n   - A graph transformer model with a GRNFormer Encoder and a variational GRNFormer decoder coupled with GRN inference module for the reconstruction of GRNs.\n\n3. **Novel Inference Strategy**: Incorporates both node features and edge features to infer GRNs for given gene expression data of any given length.\n\n### Pipeline\n\nGiven a scRNA-seq dataset, a gene co-expression network is first constructed, from which a set of subgraphs are sampled by TF-Walker. The subgraphs are processed by GeneTranscoder to generate node and edge embeddings, which are fed to the variational graph transformer autoencoder to learn a GRN representation. The representation is used to infer a gene regulatory sub-network for each subgraph. The subnetworks are aggregated to construct a full GRN.\n\n## Installation\n\n### Prerequisites\n\n- Python 3.11+\n- CUDA-capable GPU (recommended for training)\n- Conda or Miniconda\n\n### Setup\n\n1. Clone the repository:\n\n```bash\ngit clone https://github.com/BioinfoMachineLearning/GRNformer.git\ncd GRNformer\n```\n\n2. Set up conda environment and install necessary packages using the setup script:\n\n```bash\nbash setup.sh\n```\n\nAlternatively, you can manually create the environment:\n\n```bash\nconda env create -f environment.yml\nconda activate grnformer\n```\n\n## Usage\n\n### Quick Start: Inference on Your Data\n\nRun GRNFormer inference on a sample gene expression file:\n\n```bash\npython infer_grn.py \\\n    --exp_file /path/to/expression-file.csv \\\n    --tf_file /path/to/listoftfs.csv \\\n    --output_file /path/to/predicted-edges.csv \\\n    --coexpression_threshold 0.1 \\\n    --max_subgraph_size 100\n```\n\n**Input File Formats:**\n- `expression-file.csv`: Gene expression matrix with genes as rows and cells as columns (or vice versa - the script handles both orientations)\n- `listoftfs.csv`: List of transcription factor gene names (one per line or comma-separated)\n- `output_file`: Path where the predicted GRN edges will be saved (CSV format: source, target, weight/score)\n\n**Optional Parameters:**\n- `--coexpression_threshold` (default: 0.1): Threshold for constructing the co-expression network. Lower values result in denser networks, while higher values create sparser networks.\n- `--max_subgraph_size` (default: 100): Maximum number of nodes in each TF-centered subgraph sampled by TFWalker. Adjust based on your dataset size and computational resources.\n\n### Evaluation with Ground Truth\n\u003cdetails\u003e\n  \u003csummary\u003eStandard, custom, and general evaluation\u003c/summary\u003e\n\n### Standard Evaluation\n\nRun GRNFormer to evaluate performance when a ground truth network is available:\n\n```bash\npython eval_grn.py \\\n    --exp_file /path/to/expression-file.csv \\\n    --tf_file /path/to/listoftfs.csv \\\n    --net_file /path/to/ground-truth-network.csv \\\n    --output_file /path/to/predicted-edges.csv\n```\nIn addition to `predicted-edges.csv` and `predicted-edges-metrics.csv`, the\nevaluation also writes `\u003coutput_file\u003e_covered_edges.csv`, which contains the\nTF→gene edges covered by the TFWalker input (derived from the subgraph\nconstruction). This file can be passed to `scripts/general_grn_evaluation.py`\nvia `--covered_edges` to ensure only covered edges are evaluated and to compute\ncoverage.\n\n**Additional Input:**\n- `ground-truth-network.csv`: Ground truth network edges (CSV format: source, target)\n\n#### Custom Evaluation with Configurable Parameters\n\nFor evaluation with custom coexpression threshold and subgraph size:\n\n```bash\npython eval_grn_custom.py \\\n    --exp_file /path/to/expression-file.csv \\\n    --tf_file /path/to/listoftfs.csv \\\n    --net_file /path/to/ground-truth-network.csv \\\n    --output_file /path/to/predicted-edges.csv \\\n    --ckpt_path /path/to/checkpoint.ckpt \\\n    --coexpression_threshold 0.1 \\\n    --max_subgraph_size 100\n```\n\n**Additional Parameters:**\n- `--ckpt_path`: Path to the trained model checkpoint file\n- `--coexpression_threshold` (default: 0.1): Threshold for co-expression network construction\n- `--max_subgraph_size` (default: 100): Maximum subgraph size for TFWalker sampling\n\n\n### Perturbation Evaluation\n\nEvaluate model robustness under various perturbation conditions (noise and dropout):\n\n**Single test with specific perturbation:**\n```bash\npython eval_grn_perturb.py \\\n    --single_test \\\n    --exp_file /path/to/expression-file.csv \\\n    --tf_file /path/to/listoftfs.csv \\\n    --net_file /path/to/ground-truth-network.csv \\\n    --output_file /path/to/predicted-edges.csv \\\n    --ckpt_path /path/to/checkpoint.ckpt \\\n    --noise_std 0.1 \\\n    --dropout_fraction 0.05 \\\n    --coexpression_threshold 0.1 \\\n    --max_subgraph_size 100\n```\n\n**Full perturbation sweep** (tests multiple noise and dropout levels):\n```bash\npython eval_grn_perturb.py \\\n    --exp_file /path/to/expression-file.csv \\\n    --tf_file /path/to/listoftfs.csv \\\n    --net_file /path/to/ground-truth-network.csv \\\n    --output_file /path/to/predicted-edges.csv \\\n    --ckpt_path /path/to/checkpoint.ckpt \\\n    --noise_levels 0.0 0.05 0.1 0.15 0.2 \\\n    --dropout_levels 0.0 0.05 0.1 0.15 \\\n    --output_dir ./outputs/perturbation_results \\\n    --coexpression_threshold 0.1 \\\n    --max_subgraph_size 100\n```\n\n**Perturbation Parameters:**\n- `--noise_std`: Standard deviation of Gaussian noise to add to expression data (for single test)\n- `--dropout_fraction`: Fraction of genes to randomly drop (for single test)\n- `--noise_levels`: Space-separated list of noise levels for sweep (e.g., \"0.0 0.05 0.1 0.15 0.2\")\n- `--dropout_levels`: Space-separated list of dropout fractions for sweep (e.g., \"0.0 0.05 0.1 0.15\")\n- `--absolute_noise`: Use absolute noise values instead of scaled (default: noise is scaled relative to data std)\n- `--output_dir`: Directory to save perturbation sweep results\n- `--coexpression_threshold` (default: 0.1): Threshold for co-expression network construction\n- `--max_subgraph_size` (default: 100): Maximum subgraph size for TFWalker sampling\n\n\n### Complete GRN Evaluation (clean negative pool, sampling, full-matrix, EPR)\n\nGRNFormer’s complete evaluation proceeds in two stages:\n\n1. **Clean negative pool construction**\n\n   From the expression matrix and ground-truth network, we construct a **clean\n   negative evaluation pool**. This pool contains all ordered gene–gene pairs\n   `(g1, g2)` with `g1 != g2` in the expression gene set, **excluding**:\n\n   - all known positive TF–target edges from the reference network, and\n   - any training negatives you optionally provide.\n\n   This ensures that negatives used for evaluation do not overlap with known\n   positives or training negatives.\n\n2. **Metric computation**\n\n   Using the clean negative pool, the ground-truth positives, and the full\n   predicted TF–gene adjacency, we compute:\n\n   - sampled AUROC/AUPR (with bootstrapping),\n   - full-matrix AUROC/AUPR over the entire clean pool,\n   - early precision (EPR@K),\n   - coverage of the ground-truth network by the TFWalker subgraphs.\n\n---\n\n#### Step 1: Build the clean negative evaluation pool\n\nScript: `scripts/create_clean_eval_pool.py`\n\n**Purpose**\n\n- Define a clean set of negative TF–gene candidates for evaluation, consistent\n  across methods and runs.\n\n**Arguments**\n\n- `--expression`  \n  Path to `ExpressionData.csv`. Genes in the index define the gene universe.\n\n- `--network`  \n  Path to the reference regulatory network (`refNetwork.csv`). All TF–target\n  pairs in this file are treated as positives and excluded from the clean pool.\n\n- `--training_negatives` (optional)  \n  One or more CSV files with training negatives (e.g. negatives sampled during\n  model training). Any pairs in these files are also excluded from the clean pool.\n\n- `--output`  \n  Path to the output CSV, typically named\n  `clean_evaluation_pool_all_pairs.csv`. The file contains all remaining TF–gene\n  candidate pairs and is used as the negative universe for evaluation.\n\n**Example**\n```bash\npython scripts/create_clean_eval_pool.py \\\n  --expression /path/to/ExpressionData.csv \\\n  --network /path/to/refNetwork.csv \\\n  --output /path/to/clean_evaluation_pool_all_pairs.csv\n```\n\n#### Step 2: Run the general GRN evaluation\n\nScript: `scripts/general_grn_evaluation.py`\n\n**Purpose**\n\nEvaluate GRNFormer predictions against the ground-truth regulatory network\nusing the clean negative pool and TFWalker coverage.\n\n**Inputs**\n\n- `--positives`  \n  Ground-truth regulatory network (e.g. `refNetwork.csv` or `master_test.csv`).\n  If a `label` / `Label` column exists, only `label == 1` rows are used.\n\n- `--clean_negatives`  \n  Clean negative pool from Step 1 (e.g. `clean_evaluation_pool_all_pairs.csv`).\n\n- `--predictions`  \n  Full TF–gene adjacency with prediction scores (e.g. `predictedNetwork.csv`),\n  as produced by `eval_grn.py`.\n\n- `--expression`  \n  Expression matrix (`ExpressionData.csv`, genes in the index). This defines the\n  gene universe and filters positives/negatives/predictions.\n\n- `--tfs`  \n  TF list (`TFs.csv`). Positives are restricted to TF→gene edges where the\n  source is in this TF list and in the expression gene set.\n\n- `--covered_edges` (optional but recommended)  \n  CSV listing TF→gene edges covered by the TFWalker subgraphs\n  (e.g. `Gene1,Gene2`, derived from `edge_index_unique`). This encodes which\n  ground-truth TF→gene interactions are reachable in the TF-centered subgraphs\n  and is used to restrict evaluation to covered edges and to compute coverage.\n\n- `--sampled_neg_ratio`  \n  Ratio of sampled negatives to positives for sampled evaluation (default 1.0).\n\n- `--epr_k`  \n  Comma-separated K values for EPR@K (default: K = number of positives).\n\n- `--output_json`  \n  Path to save all metrics in JSON format.\n\n**Example**\n```bash\npython scripts/general_grn_evaluation.py \\\n  --positives /path/to/refNetwork.csv \\\n  --clean_negatives /path/to/clean_evaluation_pool_all_pairs.csv \\\n  --predictions /path/to/predictedNetwork.csv \\\n  --expression /path/to/ExpressionData.csv \\\n  --tfs /path/to/TFs.csv \\\n  --covered_edges /path/to/predictedNetwork_covered_edges.csv \\\n  --sampled_neg_ratio 1.0 \\\n  --epr_k 10,50,100 \\\n  --output_json /path/to/metrics.json\n```\n\n**Outputs**\n\nThe JSON produced by `--output_json` contains the following key fields:\n\n- **Counts**\n  - `total_positives_in_file`  \n    Number of TF→gene positives in the ground-truth file after TF/expression filtering.\n  - `n_positives_with_predictions`  \n    Number of positives actually evaluated (after intersecting with\n    `--covered_edges`, if provided).\n  - `positive_coverage`  \n    Fraction of ground-truth TF→gene edges covered by the TFWalker subgraphs:  \n    `n_positives_with_predictions / total_positives_in_file`.\n  - `n_full_negatives`  \n    Size of the clean negative pool.\n  - `n_sampled_negatives`  \n    Number of negatives used in each sampled evaluation run.\n\n- **Sampled metrics (per-run and bootstrapped)**\n  - `sampled_auroc`, `sampled_aupr`  \n    AUROC and AUPR for a single sampled negative set.\n  - `sampled_auroc_mean`, `sampled_auroc_std`  \n    Mean and standard deviation of sampled AUROC over 100 bootstrap repeats.\n  - `sampled_aupr_mean`, `sampled_aupr_std`  \n    Mean and standard deviation of sampled AUPR (average precision) over 100\n    bootstrap repeats.\n\n- **Full-matrix metrics**\n  - `full_auroc`, `full_aupr`  \n    AUROC and AUPR computed using all positives vs. all negatives in the clean\n    evaluation pool.\n\n- **Early Precision (EPR)**\n  - `epr@K`  \n    Early precision values at the K values specified via `--epr_k` (plus\n    `K = number of positives` if not already included).\n\n\u003c/details\u003e\n\n## Evaluation on Test Datasets\n\u003cdetails\u003e\n  \u003csummary\u003eClick to see the details\u003c/summary\u003e\n\n### Download BEELINE Datasets\n\nDownload BEELINE sc-RNAseq datasets:\n\n```bash\npython collect_data.py --data_dir ./Data/scRNA-seq/\n```\n\nThe downloaded datasets can be found in:\n- `Data/scRNA-seq/` - Expression data\n- `Data/scRNA-seq-Networks/` - Network data\n\n### Run Evaluation Pipeline\n\nRun the evaluation pipeline on test datasets with all subset creations:\n\n```bash\npython evaluation_pipeline.py \\\n    --dataset_file Data/mESC.csv \\\n    --output_dir ./outputs/evaluation\n```\n\u003c/details\u003e\n\n## Training from Scratch\n\u003cdetails\u003e\n  \u003csummary\u003eClick to see the details\u003c/summary\u003e\n  \n### 1. Prepare Datasets\n\nDownload BEELINE sc-RNAseq datasets:\n\n```bash\npython collect_data.py --data_dir ./Data/scRNA-seq/\n```\n\n**Note:** Before beginning training, copy all the Regulatory Networks (Non-specific-Chip-seq-network.csv, STRING-network.csv, [cell-type]-Chip-seq-network.csv) and TFs.csv file to the corresponding cell-type datasets in `./Data/scRNA-seq/[cell-type]/`.\n\n### 2. Combine Networks\n\nFor generalization training, GRNformer combines all the networks for every training dataset:\n\n```bash\npython dataset_combiner.py \\\n    --cell-type-network ./Data/scRNA-seq/hESC/hESC-Chip-seq-network.csv \\\n    --non-specific-network ./Data/scRNA-seq/hESC/Non-specific-Chip-seq-network.csv \\\n    --string-network ./Data/scRNA-seq/hESC/STRING-network.csv \\\n    --output-file ./Data/scRNA-seq/hESC/hESC-combined.csv\n```\n\n### 3. Create Dataset Splits\n\nCreate dataset and splits for training, validation, and testing:\n\n```bash\npython create_dataset.py \\\n    --dataset_dir ./Data/sc-RNAseq \\\n    --dataset_name ./Data/train_list.csv\n```\n\n### 4. Train the Model\n\nTrain the model from scratch using the configuration file:\n\n```bash\npython main.py fit --config config/grnformer.yaml\n```\n\nYou can customize training parameters by editing `config/grnformer.yaml` or by passing command-line arguments.\n\u003c/details\u003e\n\n## Datasets\n\n### Available Datasets\n\n- **BEELINE**: https://zenodo.org/records/3701939\n- **DREAM5**: https://www.synapse.org/Synapse:syn2787209/wiki/70351\n- **PBMC3k**: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k\n- **Preprocessed PBMC**: Can be accessed from the `scanpy` Python package\n\n## Project Structure\n\n```\nGRNformer/\n├── src/\n│   ├── models/\n│   │   └── grnformer/\n│   │       ├── model.py          # Main GRNFormer model\n│   │       └── network.py        # Network architecture\n│   └── datamodules/\n│       ├── grn_datamodule.py     # Training data module\n│       ├── grn_dataset_inference.py  # Inference dataset\n│       └── grn_dataset_test.py   # Test dataset\n├── config/\n│   └── grnformer.yaml            # Training configuration\n├── main.py                       # Training entry point\n├── infer_grn.py                  # Inference script\n├── eval_grn.py                        # Standard evaluation script\n├── eval_grn_custom.py                 # Custom evaluation with configurable parameters\n├── eval_grn_perturb.py                # Perturbation evaluation script\n├── scripts/general_grn_evaluation.py  # General GRN evaluation (sampled/full AUROC/AUPR, EPR, coverage)\n├── scripts/create_clean_eval_pool.py  # Clean negative pool construction\n├── evaluation_pipeline.py             # Full evaluation pipeline\n├── create_dataset.py                  # Dataset creation\n├── dataset_combiner.py                # Network combination\n├── collect_data.py                    # Data download\n└── environment.yml                    # Conda environment\n```\n\n## Citation\n\nIf you use GRNFormer in your research, please cite:\n\n```bibtex\n@article {Hegde2025.01.26.634966,\n\tauthor = {Hegde, Akshata and Cheng, Jianlin},\n\ttitle = {GRNFormer: Accurate Gene Regulatory Network Inference Using Graph Transformer},\n\telocation-id = {2025.01.26.634966},\n\tyear = {2025},\n\tdoi = {10.1101/2025.01.26.634966},\n\tpublisher = {Cold Spring Harbor Laboratory},\n\tURL = {https://www.biorxiv.org/content/early/2025/01/27/2025.01.26.634966},\n\teprint = {https://www.biorxiv.org/content/early/2025/01/27/2025.01.26.634966.full.pdf},\n\tjournal = {bioRxiv}\n}\n```\n\n## License\n\nSee [LICENSE](LICENSE) file for details.\n\n## Contact\n\nFor questions or issues, please open an issue on the GitHub repository.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbioinfomachinelearning%2Fgrnformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbioinfomachinelearning%2Fgrnformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbioinfomachinelearning%2Fgrnformer/lists"}