{"id":19154623,"url":"https://github.com/bcgsc/impala","last_synced_at":"2025-08-10T13:12:50.004Z","repository":{"id":65388280,"uuid":"590257376","full_name":"bcgsc/IMPALA","owner":"bcgsc","description":"Integrated Mapping and Profiling of Allelically-expressed Loci with Annotations","archived":false,"fork":false,"pushed_at":"2023-10-31T22:22:54.000Z","size":21016,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-02-22T21:28:03.772Z","etag":null,"topics":["allele-specific-expression","bioinformatics","cancer","snakemake-workflow"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bcgsc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-18T01:40:52.000Z","updated_at":"2024-11-03T01:26:54.000Z","dependencies_parsed_at":null,"dependency_job_id":"f3d3ed1b-a75d-46fc-b026-8f19b11e24cb","html_url":"https://github.com/bcgsc/IMPALA","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/bcgsc/IMPALA","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcgsc%2FIMPALA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcgsc%2FIMPALA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcgsc%2FIMPALA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcgsc%2FIMPALA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bcgsc","download_url":"https://codeload.github.com/bcgsc/IMPALA/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcgsc%2FIMPALA/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269729222,"owners_count":24465786,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-10T02:00:08.965Z","response_time":71,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["allele-specific-expression","bioinformatics","cancer","snakemake-workflow"],"created_at":"2024-11-09T08:27:38.437Z","updated_at":"2025-08-10T13:12:49.986Z","avatar_url":"https://github.com/bcgsc.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"res/impala_logo.png\" width=\"280\" height=\"280\"\u003e\n\n# Integrated Mapping and Profiling of Allelically-expressed Loci with Annotations \n[![DOI](https://zenodo.org/badge/590257376.svg)](https://zenodo.org/badge/latestdoi/590257376)\n[![example workflow](https://github.com/bcgsc/IMPALA/actions/workflows/run_snakemake.yaml/badge.svg)](https://github.com/bcgsc/IMPALA/actions/workflows/run_snakemake.yaml)\n[![Snakemake](https://img.shields.io/badge/snakemake-≥5.6.0-brightgreen.svg?style=flat)](https://snakemake.readthedocs.io)\n\nThis Snakemake workflow calls allele-specific expression genes using short-read RNA-seq. Phasing information derived from long-read data by tools such as WhatsHap can be provided to increase the performance of the tool, and to link results to features of interest. Copy number variant data, allelic methylation data and somatic variant data can also be provided to analyze genes with allele specific expression.\n\n\nTable of Contents\n=================\n\n* **[Overall Workflow](#overall-workflow)**\n* **[Installation](#installation)**\n  * [Dependencies](#dependencies)\n* **[Input Files](#input-files)**\n  * [Optional input](#optional-inputs)\n* **[Running Workflow](#running-workflow)**\n  * [Edit config file](#edit-the-config-files)\n  * [Running snakemake workflow](#run-snakemake)\n* **[Output Files](#optional-inputs)**\n  * [Summary Output](#summary-table-description)\n  * [Example Figures](#example-figures)\n* **[Contributors](#contributors)**\n* **[License](#license)**\n\n\n# Overall Workflow\n\u003cimg src=\"res/IMPALA_workflow.jpg\" width=90%\u003e\n\n\u003cbr\u003e\n\n# Installation\nThis will clone the repository. You can run the IMPALA within this directory.\n```\ngit clone https://github.com/bcgsc/IMPALA.git\n```\n\n### Dependencies\n\u003e To run this workflow, you must have snakemake (v6.12.3) and singularity (v3.5.2-1.1.el7). You can install snakemake using [this guide](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) and singularity using [this guide](https://docs.sylabs.io/guides/3.5/admin-guide/installation.html). The remaining dependencies will be downloaded automatically within the snakemake workflow.\n\n# Input Files\n\n### **Method 1**\u003csup\u003e†\u003c/sup\u003e: RNA reads: \u003cbr /\u003e\n- RNA paired end reads (R1 \u0026 R2 fastq file)\n\n### **Method 2**\u003csup\u003e§\u003c/sup\u003e: RNA alignment: \u003cbr /\u003e\n- RNA alignment alignment (bam file)\n- Expression Matrix \n    - Expression in RPKM/TPM\n    - Gene name must be in HGNC format\n    - Column name is \"Gene\" and sample names\n\n\n### **Optional Inputs:**\n- Phase VCF\n    - Can be obtained using [WhatsHap](https://github.com/whatshap/whatshap/) with DNA long reads\n    - Significantly improves precision of ASE calling\n    - Adds TFBS mutation and stop gain/loss information \n - Copy Number Variant Data\n    - Can be optained using [ploidetect](https://github.com/lculibrk/Ploidetect)\n- Allelic Methylation\n    - Can be optained using [NanoMethPhase](https://github.com/vahidAK/NanoMethPhase)\n- Somatic mutations\n    - Finds somatic mutations in ASE gene and promoters\n- Tumor Content\n    - Used to calcualte the expected major allele frequency \n    - Assumes 1.0 if not specified\n- Tissue type\n    - Include data for average MAF in normal tissue in summary table\n    - Otained from GTex database which ran [phASER](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02122-z) to calcualte allelic expression \n\n# Running Workflow\n\n### **Edit the config files**\n\n#### **Example parameters.yaml:** \u003cbr /\u003e\nConfig files to specify parameters and paths needed for the workflow. The main parameter to include is the genome name, path to expression matrix, major allele frequency threshold and threads as well as settings for using phased vcf and doing cancer analysis.\n```\n# genome_name should match bams\ngenome_name: hg38/hg19/hg38_no_alt_TCGA_HTMCP_HPVs\n\n# RPKM matrix of the samples\nmatrix: /path/to/expression/matrix.tsv\n\n# Major allele frequency threshold for ASE (0.5 - 0.75)\nmaf_threshold: 0.65\n\n# Threads for STAR, RSEM, Strelka and MBASED\nthreads: 72\n\n# Use phased vcf (True or False)\n# Uses pseudphasing algorithm if False\nphased: True\n\n# Perform cancer analysis \n# Intersect with optional input\ncancer_analysis: True\n\n\n# Paths for annotation\nannotationPath:\n    snpEff_config:\n        /path/to/snpEff/config\n    snpEff_datadir:\n        /path/to/snpEff/binaries/data\n    snpEff_genomeName:\n        GRCh38.100\n    snpEff_javaHeap:\n        64g\n\n# Paths for references\n# Only needed if RNA read is provided instead of RNA bam\nstarReferencePath:\n    /path/to/star/ref\nrsemReferencePath:\n    /path/to/rsem/ref\n```\n#### **Example samples.yaml:** \u003cbr /\u003e\nMain config file to specify input files. For input method 1 using R1 and R2 fastq file, use `R1` and `R2` tag. For input method 2 using RNA bam file, use  `rna` tag. All other tags are optional.\n\n```\nsamples:\n    # Sample Name must match expression matrix\n    sampleName_1: # Method 1\n        R1:\n            /path/to/RNA/R1.fq\n        R2:\n            /path/to/RNA/R2.fq\n        somatic_snv:\n            /path/to/somatic/snv.vcf\n        somatic_indel:\n            /path/to/somatic/indel.vcf\n        tissueType:\n            Lung\n    sampleName_2: # Method 2\n        rna:\n            /path/to/RNA/alignment.bam\n        phase:\n            /path/to/phase.vcf.gz\n        cnv:\n            /path/to/cnv/data\n        methyl:\n            /path/to/methyl/data.bed\n        tumorContent:\n            0.80\n```\n\n\n#### **Example defaults.yaml:** \u003cbr /\u003e\nConfig file for specify path for reference genome, annotation bed file and centromere bed file. Annotation and centromere bed file for hg38 are included in the repository.\n\n```\ngenome:\n    hg19:\n        /path/to/hg19/ref.fa\n    hg38:\n        /path/to/hg38/ref.fa\n    hg38_no_alt_TCGA_HTMCP_HPVs:\n        /path/to/hg38_no_alt_TCGA_HTMCP_HPVs/ref.fa\n\nannotation:\n    hg19:\n        /path/to/hg19/annotation.fa\n    hg38:\n        annotation/biomart_ensembl100_GRCh38.sorted.bed  \n    hg38_no_alt_TCGA_HTMCP_HPVs:\n        annotation/biomart_ensembl100_GRCh38.sorted.bed\n\ncentromere:\n    hg19:\n        /path/to/hg19/centromere.bed\n    hg38:\n        annotation/hg38_centromere_positions.bed\n    hg38_no_alt_TCGA_HTMCP_HPVs:\n        annotation/hg38_centromere_positions.bed\n```\n\n\n\n### **Run snakemake**\nThis is the command to run it with singularity. The `-c` parameter can be used to specify maximum number of threads. The `-B` parameter is used to speceify paths for the docker container to bind. \n\n```\nsnakemake -c 30 --use-singularity --singularity-args \"-B /projects,/home,/gsc\"\n```\n# Output Files\nAll output and intermediary files is found in `output/{sample}` directory. The workflow has four main section, alignment, variant calling, mbased and cancer analysis and their outputs can be found in the corrosponding directories. The key outputs from the workflow is located below\n\n1. MBASED related outputs (found in `output/{sample}/mbased`)\n    - The tabular results of the output `MBASED_expr_gene_results.txt`\n    - The rds object of the MBASED raw output `MBASEDresults.rds`\n2. Summary table of all outputs\n    - Found in `output/{sample}/summaryTable.tsv`\n    - Data of all phased genes with ASE information along potential causes based on optional inputs\n3. Figures \n    - Found in `output/{sample}/figures`\n    - Example figure shown below\n\n\n## **Summary Table Description**\n| Column               | Description                                                                            | \n| :---                 |    :----:                                                                              |  \n| gene                 | HGNC gene symbol                                                                       | \n| Expression           | Expression level                                                                       | \n| allele1IsMajor       | T/F if allele 1 is the major allele (allele 1 = HP1)                                   | \n| majorAlleleFrequency | Major allele frequency                                                                 | \n| padj                 | Benjamini-Hochberg adjusted pvalue                                                     | \n| aseResults           | ASE result based on MAF threshold (and pval)                                           | \n| cnv.A\u003csup\u003e1\u003c/sup\u003e               | Copy Number for allele 1                                                               |\n| cnv.B\u003csup\u003e1\u003c/sup\u003e              | Copy Number for allele 2                                                               |\n| expectedMAF\u003csup\u003e1\u003c/sup\u003e         | Expect Major Allele Frequency based on CNV                                             |\n| cnv_state\u003csup\u003e1\u003c/sup\u003e           | Allelic CNV state (Loss of Heterozygosity, Allelic balance/imbalabnce)                 |\n| methyl_state\u003csup\u003e2\u003c/sup\u003e       | Methylation difference in promter region (Allele 1 - Allele 2) |\n| tf_allele\u003csup\u003e3\u003c/sup\u003e         | Allele where there is gain of transcription factor binding site                        |\n| transcriptionFactor\u003csup\u003e3\u003c/sup\u003e | Transcription Factor for gain TFBS                                                     |\n| stop_variant_allele\u003csup\u003e3\u003c/sup\u003e | Allele where stop gain/stop loss variant is found                                      |\n| somaticSNV\u003csup\u003e4\u003c/sup\u003e        | Somatic SNV found in (or around) gene (T/F)                                            |\n| somaticIndel\u003csup\u003e4\u003c/sup\u003e      | Somatic Indel found in (or around) gene (T/F)                                          |\n| normalMAF\u003csup\u003e5\u003c/sup\u003e        | Add MAF for gene in normal tissue                                                      |\n| cancer_gene          | T/F if gene is a known cancer gene (based on `annotation/cancer_gene.txt`)             |\n| sample               | Sample Name                                                                            |\n\nColumns only included if optional input is included:\n\n\u003csup\u003e1\u003c/sup\u003e Copy number variant\n\u003csup\u003e2\u003c/sup\u003e Allelic methylation \n\u003csup\u003e3\u003c/sup\u003e Phased vcf \n\u003csup\u003e4\u003c/sup\u003e Somatic SNV and Indel\n\u003csup\u003e5\u003c/sup\u003e Tissue type\n\n# Example Figures\n\nSeveral figures are automatically generate based on the optional inputs. They can be found in `output/{sample}/figures`. The main figure is `karyogram.pdf` which show co-locationzation of ASE genes with allelic methylation and somatic copy number alteration. Example figures can be found [here](res/exampleFigure.md). \n\n\n# Contributors\nThe contributors of this project are\nGlenn Chang, Vannessa Porter, and Kieran O'Neill.\n\n\u003ca href=\"https://github.com/bcgsc/IMPALA/graphs/contributors\"\u003e\n  \u003cimg src=\"https://contrib.rocks/image?repo=bcgsc/IMPALA\u0026max=1000\" /\u003e\n\u003c/a\u003e\n\n# License\n\n`IMPALA` is licensed under the terms of the [GNU GPL v3](LICENSE).\n\n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbcgsc%2Fimpala","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbcgsc%2Fimpala","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbcgsc%2Fimpala/lists"}