{"id":39778825,"url":"https://github.com/everestial/phase-stitcher","last_synced_at":"2026-01-18T12:01:29.304Z","repository":{"id":57452240,"uuid":"86477010","full_name":"everestial/phase-stitcher","owner":"everestial","description":"a python program to stitch the ReadBack phased haplotypes in F1 hybrids.","archived":false,"fork":false,"pushed_at":"2022-03-23T11:59:30.000Z","size":345,"stargazers_count":6,"open_issues_count":1,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-08-08T17:42:37.914Z","etag":null,"topics":["f1-hybrids","genome-phasing","haplotype-extension","haplotypes","phase-haplotypes","phased-genotypes","phasing","population"],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/everestial.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-03-28T15:32:23.000Z","updated_at":"2023-03-08T14:00:43.000Z","dependencies_parsed_at":"2022-08-30T01:10:33.406Z","dependency_job_id":null,"html_url":"https://github.com/everestial/phase-stitcher","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/everestial/phase-stitcher","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/everestial%2Fphase-stitcher","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/everestial%2Fphase-stitcher/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/everestial%2Fphase-stitcher/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/everestial%2Fphase-stitcher/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/everestial","download_url":"https://codeload.github.com/everestial/phase-stitcher/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/everestial%2Fphase-stitcher/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28535271,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-18T10:13:46.436Z","status":"ssl_error","status_checked_at":"2026-01-18T10:13:11.045Z","response_time":98,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["f1-hybrids","genome-phasing","haplotype-extension","haplotypes","phase-haplotypes","phased-genotypes","phasing","population"],"created_at":"2026-01-18T12:00:42.126Z","updated_at":"2026-01-18T12:01:29.278Z","avatar_url":"https://github.com/everestial.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PhaseStitcher\n\n***A python program to segregate and stitch the ReadBackPhased genotypes in F1 hybrids to prepare\na genome wide haplotype using first order markov chain and transition probabilities.\\\nThis tool can be used as a companion tool along with\n[`phase-Extender`](https://github.com/everestial/phase-Extender) or as a standalone tool.***\n\nDeveloped by [Bishwa K. Giri](mailto:kirannbishwa01@gmail.com) in\nthe [Remington Lab](https://biology.uncg.edu/people/david-remington/) at the\nUniversity of North Carolina at Greensboro, Biology department.\n\n- [PhaseStitcher](#phasestitcher)\n  - [Citation](#citation)\n  - [AUTHOR/SUPPORT](#authorsupport)\n  - [Intro to ReadBackPhasing](#intro-to-readbackphasing)\n  - [BACKGROUND](#background)\n  - [Data Requirements](#data-requirements)\n  - [Algorithm](#algorithm)\n  - [Tutorial](#tutorial)\n    - [Prerequisites](#prerequisites)\n    - [Installation from pypi:](#installation-from-pypi)\n    - [Installation  and setup from source (Optional)](#installation--and-setup-from-source-optional)\n  - [Usage](#usage)\n  - [Sample example](#sample-example)\n  - [Output Files](#output-files)\n    - [*f1Sample*_haplotype_long.txt](#f1sample_haplotype_longtxt)\n    - [*f1Sample*_haplotype_wide.txt](#f1sample_haplotype_widetxt)\n    - [*f1Sample*_haplotype_stats.txt](#f1sample_haplotype_statstxt)\n  - [Some Q/A on phase-stitcher](#some-qa-on-phase-stitcher)\n\n## Citation\n\nGiri, B. K., Remington D. L. Haplotype phase extension and preparation of\ndiploid genome using phase-Extender and phase-Stitcher. biorxiv (2018) [not uploaded yet].\n\n## AUTHOR/SUPPORT\n\nBishwa K. Giri (bkgiri@uncg.edu; kirannbishwa01@gmail.com) \\\nSupport @ \u003chttps://groups.google.com/d/forum/phase-extender\u003e\n\n## Intro to ReadBackPhasing\n\n**Check these links for details on readbackphasing*\n\n- \u003chttps://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_phasing_ReadBackedPhasing.php\u003e\n- \u003chttps://github.com/secastel/phaser/tree/master/phaser\u003e\n\n## BACKGROUND\n\nHaplotype phasing is a second \"go to\" problem in bioinformatics after read alignment.\nThe importance of haplotype phasing applies directly to the analyses of ASE (allele specific expression),\npreparation of extended haplotype for EHH (extended haplotype homozygosity) test,\nand preparation of dipolid genome which will soon be a new standard in bioinformatics in coming years, etc.\nThe necessity for haplotype phasing (and eventually diploid genome) increases with the increase in heterozygosity\nin the genome, because higher hetetogeneity leads to larger alignment bias and complicates the reliability of\nthe variants that are called using that alignment data (SAM, BAM files).\n\nGene expression quantification using sequence reads is occupying a dominant and standard sphere of functional\nanalyses of putative genetic loci.\nQuantification of gene expression as a test of phenotypic differences comes in two flavor:\n\n- DE (differential expression) \u003e gene expression differences quantified between two individuals or\n  groups categorized by population, treatments, space or time.\n  - ASE (allele specific expression) \u003e gene expression differences quantified between the two alleles of\n  the same locus in a diploid individual, categorized mainly by haplotypes but may be further\n  grouped by population, treatments, space or time.\n\nQuantification of RNAseq reads using reference genome based approach mainly involves haploid genome.\nASE (allele specific expression) quantification using alignment of RNAseq reads from F1 hybrids or outbred individuals\non haploid reference genomes is more susceptible to biased ASE observation considering following factors:\n\n- alignment to reference genome will likely trigger higher mapping of the reads from the\n  population closer to the reference genome.\n  - using allele masking based approach on haploid reference to address alignment bias is again more likely to attract\n  paralogous alignment in the masked region creating futher biases.\n\nSeveral new approaches have been created for better estimation of ASE analyses. Most optimal approach to\nASE however involves **1)** preparation of phased genome of the hybrids or outbred individual,\n**2)** then preparation of diploid genome and/or transcriptome and then **3)** competitive alignment of the reads\non diploid genome using conserved strategy.\nThe first step involving haplotype phasing has mostly concentrated around fixing the phase state in\nhumans who have highly homogenous genome, in inbred lines of mice and in other model systems that have lots of\nhaplotype reference panels available.\nExisting phase methods of F1 hybrids involves `mom, dad, child` trio, which is not optimal when parental information\nare missing, or in the natural hybrids where parental identification is not possible.\nAlso, existing trio methods are mainly genotype based which take allele at single position at once.\n\nASE (allele specific expression) analyses which aims to identify cis regulatory factors and mechanism underlying gene expression\ndifferences are heading toward more genomic and rnaseq based approaches.\nThe full resolution of ASE therefore relies on the quality of the phased dipoid genome.\n\n**`phase-Stitcher`** is designed to utilize RBphase data with population based haplotype phasing\nto solve phase states in F1s. The approach is to take RBphased haplotype blocks of a single F1 hybrid individual\nand several haplotype from the two different parental background of the F1, then segregate the haplotype of F1\nby computing likelihood of each haplotype against two parental background. **The advantages of using `phase-Stitcher` is\nexplained below:**\n\n- With increase in the size of sequence reads (mainly PE i.e paired end reads) we are able to generate larger\n  RBphased fragments. These fragments are again considerably larger when a heterogenous population is sequenced.\n  F1 hyrbids of these heterogenous population have even larger RBphased fragments.\n  Thus, haplotype phasing using RBphased data with population based likelihood estimates provides more optimal approach\n  to solving phase state.\n  - This tool doesn't require exact `maternal, parental` genotype data to solve phase state in F1.\n  Rather phasing can be casually approached by supplying genotype data from `maternal vs. parental` background.\n\n## Data Requirements\n\n**phASE-Stitcher** can be used with the multi-sample vcf files produced by GATK pipeline or other tools that generate\nreadbackphased haplotype blocks in the output VCF.\nA HAPLOTYPE file is created using the RBphased VCF and then piped into **phase-Stitcher**.\nUse [VCF-Simplify](https://github.com/everestial/VCF-simplify) to prepare HAPLOTYPE file from multisample VCF.\nSee, this example for data structure of input haplotype file\n[sample input haplotype file01](https://github.com/everestial/pHASE-Stitcher/blob/master/example_01/haplotype_file01.txt)\n\n- a tab separated text file with `PI` and `PG_al` value for each samples.\n\n## Algorithm\n\nFor the **mcve** regarding the algorithm see this issue on [**stackoverflow**]() and/or [**my blog**]().\n\n## Tutorial\n\n### Prerequisites\n\n**phASE-Stitcher** is written in python3, so you need to have python3 installed on your system to run this code locally. If you don't have python installed then, you can install from [here](https://www.python.org/downloads/). For linux; you can get latest python3 by:\n\n`sudo apt-get install python3`\n\n### Installation from pypi:\n\nPhaseStitcher is hosted on pypi. So, you can install it using pip as:\n```bash\n$ pip install phase-stitcher\n\n# After installation is complete you can run help function to get its parameters.\n\n$ phase-stitcher -h\n```\nNow you can jump to usage and replace `python3 phase_stitcher.py` with `phase-stitcher`.\n\n### Installation  and setup from source (Optional)\n\n1. Clone this repo.\n\n``` bash\ngit clone https://github.com/everestial/phASE-Stitcher\ncd phASE-Stitcher\n```\n\n2. Make virtual env for python and install requirements.\n\n``` bash\npython3 -m venv .env\nsource .env/bin/activate   # for linux\n.env\\Scripts\\activate      # for windows\npip install -r requirements.txt\n```\n\nOR, you can install latest versions individually by:\n\n``` bash\npip install pandas numpy matplotlib\n\n```\n\n3. To run tests locally:\n\n  ``` bash\n    pip install pytest\n    pytest .\n   ```\n\n## Usage\n\n```\n$ phase-stitcher --help\nusage: phase-stitcher [-h] [--nt NT] --input INPUT --pat PAT --mat MAT --f1Sample F1SAMPLE [--outPatMatID OUTPATMATID] [--output OUTPUT]\n                      [--lods LODS] [--culLH CULLH] [--chr CHR] [--hapStats HAPSTATS]\n\noptions:\n  -h, --help            show this help message and exit\n  --nt NT               number of process to run -\u003e The maximum number of processes that can be run at once is the number of different chromosomes\n                        (contigs) in the input haplotype file.\n  --input INPUT         name of the input haplotype file -\u003e This haplotype file should contain unique index represented by 'PI' and phased genotype\n                        represented by 'PG_al' for all the samples.\n  --pat PAT             Paternal sample or comma separated sample names that belong to Paternal background. Sample group may also be assigned using\n                        prefix. Options: 'paternal sample name', 'comma separated samples', 'pre:...'. Unique prefix (or comma separated prefixes)\n                        should begin with 'pre:'.\n  --mat MAT             Maternal sample or sample names (comma separated) that belong to maternal background. Sample group can also be assigned\n                        using unique prefix/es. Options: 'maternal sample name', 'comma separated samples', 'pre:...'. Unique prefix (or comma\n                        separated prefixes) should begin with 'pre:'.\n  --f1Sample F1SAMPLE   Name of the F1-hybrid sample. Please type the name of only one F1 sample.\n  --outPatMatID OUTPATMATID\n                        Prefix of the 'Paternal (dad)' and 'Maternal (mom)'genotype in the output file. This should be a maximum of three letter\n                        prefix separated by comma. Default: 'pat,mat'.\n  --output OUTPUT       Name of the output directory. Default: f1SampleName + '_stitched'\n  --lods LODS           log(2) odds cutoff threshold required to assign maternal Vs. paternal haplotype segregation and stitching.\n  --culLH CULLH         Cumulative likelhood estimates -\u003e The likelhoods for haplotype segregation can either be max-sum vs. max-product. Default:\n                        maxPd i.e max-product. Options: 'maxPd' or 'maxSum'.\n  --chr CHR             Restrict haplotype stitching to a specific chromosome.\n  --hapStats HAPSTATS   Computes the descriptive statistics of final haplotype. Default: 'no'.Option: 'yes', 'no' .\n\n```\n\n\u003eNOTE Input haplotype file should contain `PI` and `PG_al` values for each sample.\n\n  Requires a readbackphased `haplotype file` as input and returns segregated and stitched haplotype file in both wide\n  and long format. Descriptive statistics of the final haplotype can also be produced if desired.\n\nCheck this detailed [step by step tutorial](https://github.com/everestial/pHASE-Stitcher/wiki) for preparation\nof `input files` and know-how about running `phase-Stitcher`.\n\n## Sample example \n```\n$ phase-stitcher --nt 1 --input tests/inputs/haplotype_file01.txt --mat MA605 --pat Sp21 --f1Sample ms02g --culLH maxSum --lods 3 --hapStats yes\n\n  - using haplotype file \"tests/inputs/haplotype_file01.txt\" \n  - F1-hybrid of interest: \"ms02g\" \n  - using \"1\" processes \n  - using log2 odds cut off of \"3\" \n  - using \"max sum\" to estimate the cumulative maximum likelyhood while segregating the diploid haplotype block into maternal vs. paternal haplotype \n  - statistics of the haplotype before and after extension will be prepared for the sample of interest i.e \"ms02g\" \n#######################################################################\n        Welcome to phase-Stitcher version 1.2       \n  Author: kiran N' bishwa (bkgiri@uncg.edu, kirannbishwa01@gmail.com) \n#######################################################################\n\n\n##########################\n - Worker maximum memory usage: 85008.00 (mb)\n\nCompleted haplotype segregation and stitching for all the chromosomes.\nTime elapsed: '0.341528' sec. \n - Global maximum memory usage: 87792.00 (mb)\nCompleted writing the dataframes .....\nThe End :)\n```\n\nFor exploratory analysis of statistics generated from file: visit this [notebook](EDA%20on%20hapstats%20data%20generated%20from%20phase%20stitcher.ipynb).\n\n## Output Files\n\n### *f1Sample*_haplotype_long.txt\n\nFinal haplotype for **f1Sample** of interest after phase segregation in **long format**.\n\n- **CHROM** - Contig name (or number).\n- **POS** - Start position of haplotype (1 based).\n- **REF** - Reference allele at that site.\n- **all-alleles** - All the alleles represented by all the samples in the input file at that site.\n- **_f1Sample_:PI** - Unique `PI` index of the haplotype blocks for sample of interest.\n- **_f1Sample_:PG_al** - Phased GT (genotype) alleles at the genomic position that belong to unique `PI` indexes.\n- **log2Odds** - log2 of Odds computed between the left vs. right haplotype against observed haplotype in\npaternal vs. maternal samples.\n- **_pat_ _hap** - Haplotype that belongs to paternal background based on **_lods_** cutoff.\n- **_mat_ _hap** - Haplotype that belongs to maternal background based on **_lods_** cutoff.\n\n### *f1Sample*_haplotype_wide.txt\n\n- Final haplotype for **f1Sample** of interest after phase segregation in **wide format**.\n- All the headers are the same as file in **_long format_** except **_POS_Range_**\n\n### *f1Sample*_haplotype_stats.txt\n\nDescriptive haplotype statistics of the input haplotype file for the sample of interest. These statistics\ncan be used to compute the distribution of several values (lods, number or variants etc.) between phased\nand unphased haplotype blocks and if they were assigned to final genome wide haplotype.\n\n- **CHROM** - Contig name (or number).\n- **phasedBlock** - Blocks that were phased to genome wide haplotype based on **_lods_** cutoff.\n- **unphasedBlock** - Blocks that were not phased to genome wide haplotype based on **_lods_** cutoff.\n- **numVarsInPhasedBlock** - Number of variants in each **_phasedBlock_**.\n- **numVarsInUnPhasedBlock** - Number of variants in each **_unphasedBlock_**.\n- **log2oddsInPhasedBlock** - Calculated **log2Odds** in each **_phasedBlock_**.\n- **log2oddsInUnPhasedBlock** - Calculated **log2Odds** in each **_unphasedBlock_**.\n- **totalNumOfBlock** - Total number of RBphased blocks in the given **_f1Sample_**.\n- **totalNumOfVars** - Total number of variants in the given **_f1Sample_**.\n\n**Note:** - The **block index i.e PI** in **_phasedBlock_** and in **_unphasedBlock_**,\nand it's associated statistics are in order.\n\n## Some Q/A on phase-stitcher\n\nThe conjoined **Q/A** for **_phase stitcher_** is covered under **Q/A** for\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feverestial%2Fphase-stitcher","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feverestial%2Fphase-stitcher","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feverestial%2Fphase-stitcher/lists"}