{"id":13592515,"url":"https://github.com/freeseek/mocha","last_synced_at":"2026-01-31T16:16:30.916Z","repository":{"id":41267700,"uuid":"50928932","full_name":"freeseek/mocha","owner":"freeseek","description":"MOsaic CHromosomal Alterations (MoChA) caller","archived":false,"fork":false,"pushed_at":"2025-08-20T02:45:18.000Z","size":2222,"stargazers_count":87,"open_issues_count":7,"forks_count":23,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-08-20T04:43:48.171Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/freeseek.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-02-02T15:10:23.000Z","updated_at":"2025-08-20T02:45:22.000Z","dependencies_parsed_at":"2023-01-30T04:30:55.404Z","dependency_job_id":"addd4cd5-07c1-4109-abc5-23cc064243f7","html_url":"https://github.com/freeseek/mocha","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/freeseek/mocha","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/freeseek%2Fmocha","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/freeseek%2Fmocha/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/freeseek%2Fmocha/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/freeseek%2Fmocha/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/freeseek","download_url":"https://codeload.github.com/freeseek/mocha/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/freeseek%2Fmocha/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28947573,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-31T14:26:55.697Z","status":"ssl_error","status_checked_at":"2026-01-31T14:26:52.545Z","response_time":128,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T16:01:10.248Z","updated_at":"2026-01-31T16:16:30.909Z","avatar_url":"https://github.com/freeseek.png","language":"C","funding_links":[],"categories":["Genetics","CNV"],"sub_categories":["Accelerometer"],"readme":"![](mocha_logo.png)\n\nA BCFtools extension to call mosaic chromosomal alterations starting from phased VCF files with either B Allele Frequency (BAF) and Log R Ratio (LRR) or allelic depth (AD). If you use this tool in your publication, please cite the following papers from [2018](http://doi.org/10.1038/s41586-018-0321-x) and [2020](http://doi.org/10.1038/s41586-020-2430-6)\n```\nLoh P., Genovese G., McCarroll S., Price A. et al. Insights about clonal expansions from 8,342 mosaic\nchromosomal alterations. Nature 559, 350–355 (2018). [PMID: 29995854] [DOI: 10.1038/s41586-018-0321-x]\n\nLoh P., Genovese G., McCarroll S., Monogenic and polygenic inheritance become\ninstruments for clonal selection (2020). [PMID: 32581363] [DOI: 10.1038/s41586-020-2430-6]\n```\nand this website. For any feedback or questions, contact the [author](mailto:giulio.genovese@gmail.com)\n\nWARNING: MoChA will not yield useful results for VCFs from whole exome sequencing data as MoChA does not model the reference allele bias in these assays. Furthermore, whole exome sequencing does not include enough heterozygous sites to allow for detection of mosaic chromosomal alterations at low cell fractions. Similarly, low coverage whole genome sequencing will not provide a sufficient sampling of molecules to detect mosaic chromosomal alterations at low cell fractions, even in the unlikely ideal scenario of most heterozygous sites correctly genotyped and phased\n\n\u003c!--ts--\u003e\n   * [Usage](#usage)\n   * [Installation](#installation)\n   * [Download resources for GRCh37](#download-resources-for-grch37)\n   * [Download resources for GRCh38](#download-resources-for-grch38)\n   * [Prepare data](#prepare-data)\n   * [Phase genotypes](#phase-genotypes)\n   * [Call chromosomal alterations](#call-chromosomal-alterations)\n   * [Filter callset](#filter-callset)\n   * [Generate mosaic phenotypes](#generate-mosaic-phenotypes)\n   * [Compute allelic shift](#compute-allelic-shift)\n   * [Plot results](#plot-results)\n   * [HMM parameters](#hmm-parameters)\n   * [Acknowledgements](#acknowledgements)\n\u003c!--te--\u003e\n\nUsage\n=====\n\nA set of [WDL](http://github.com/freeseek/mochawdl) pipelines are available to run the entire MoChA pipeline from raw intensity files to final calls and imputed VCFs\n\n```\nUsage:   bcftools +mocha [OPTIONS] \u003cin.vcf.gz\u003e\n\nRequired options:\n    -g, --genome \u003cassembly\u003e[?]      predefined genome reference rules, 'list' to print available settings, append '?' for details\n    -G, --genome-file \u003cfile\u003e        genome reference rules, space/tab-delimited CHROM:FROM-TO,TYPE\n\nGeneral Options:\n    -v, --variants [^]\u003cfile\u003e        tabix-indexed [compressed] VCF/BCF file containing variants\n    -f, --apply-filters \u003clist\u003e      require at least one of the listed FILTER strings (e.g. \"PASS,.\")\n                                    to include (or exclude with \"^\" prefix) in the analysis\n    -e, --exclude \u003cexpr\u003e            exclude sites for which the expression is true\n    -i, --include \u003cexpr\u003e            select sites for which the expression is true\n    -r, --regions \u003cregion\u003e          restrict to comma-separated list of regions\n    -R, --regions-file \u003cfile\u003e       restrict to regions listed in a file\n        --regions-overlap 0|1|2     Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1]\n    -t, --targets [^]\u003cregion\u003e       restrict to comma-separated list of regions. Exclude regions with \"^\" prefix\n    -T, --targets-file [^]\u003cfile\u003e    restrict to regions listed in a file. Exclude regions with \"^\" prefix\n        --targets-overlap 0|1|2     Include if POS in the region (0), record overlaps (1), variant overlaps (2) [0]\n    -s, --samples [^]\u003clist\u003e         comma separated list of samples to include (or exclude with \"^\" prefix)\n    -S, --samples-file [^]\u003cfile\u003e    file of samples to include (or exclude with \"^\" prefix)\n        --force-samples             only warn about unknown subset samples\n        --input-stats \u003cfile\u003e        input samples genome-wide statistics file\n        --only-stats                compute genome-wide statistics without detecting mosaic chromosomal alterations\n    -p  --cnp \u003cfile\u003e                list of regions to genotype in BED format (4th column DEL, DUP, or CNV)\n        --fra \u003cfile\u003e                list of commonly deleted regions in BED format (4th column phred-scaled likelihood)\n        --mhc \u003cregion\u003e              MHC region to exclude from analysis (will be retained in the output)\n        --kir \u003cregion\u003e              KIR region to exclude from analysis (will be retained in the output)\n        --threads \u003cint\u003e             number of extra output compression threads [0]\n\nOutput Options:\n    -o, --output \u003cfile\u003e             write output to a file [no output]\n    -O, --output-type u|b|v|z[0-9]  u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v]\n        --no-version                do not append version and command line to the header\n    -a  --no-annotations            omit Ldev and Bdev FORMAT from output VCF (requires --output)\n        --no-log                    suppress progress report on standard error\n    -l  --log \u003cfile\u003e                write log to file [standard error]\n    -c, --calls \u003cfile\u003e              write chromosomal alterations calls table to a file [standard output]\n    -z  --stats \u003cfile\u003e              write samples genome-wide statistics table to a file [no output]\n    -u, --ucsc-bed \u003cfile\u003e           write UCSC bed track to a file [no output]\n    -W, --write-index[=FMT]         Automatically index the output files [off]\n\nHMM Options:\n        --bdev-LRR-BAF \u003clist\u003e       comma separated list of inverse BAF deviations for LRR+BAF model [-2.0,-4.0,-6.0,10.0,6.0,4.0]\n        --bdev-BAF-phase \u003clist\u003e     comma separated list of inverse BAF deviations for BAF+phase model\n                                    [6.0,8.0,10.0,15.0,20.0,30.0,50.0,80.0,130.0,210.0,340.0,550.0]\n        --min-dist \u003cint\u003e            minimum base pair distance between consecutive sites for WGS data [400]\n        --adjust-BAF-LRR \u003cint\u003e      minimum number of genotypes for a cluster to median adjust BAF and LRR (-1 for no adjustment) [5]\n        --regress-BAF-LRR \u003cint\u003e     minimum number of genotypes for a cluster to regress BAF against LRR (-1 for no regression) [15]\n        --LRR-GC-order \u003cint\u003e        order of polynomial to regress LRR against local GC content (-1 for no regression) [2]\n        --xy-major-pl               major transition phred-scaled likelihood [65.0]\n        --xy-minor-pl               minor transition phred-scaled likelihood [35.0]\n        --auto-tel-pl               autosomal telomeres phred-scaled likelihood [20.0]\n        --chrX-tel-pl               chromosome X telomeres phred-scaled likelihood [8.0]\n        --chrY-tel-pl               chromosome Y telomeres phred-scaled likelihood [6.0]\n        --error-pl                  uniform error phred-scaled likelihood [15.0]\n        --flip-pl                   phase flip phred-scaled likelihood [20.0]\n        --short-arm-chrs \u003clist\u003e     list of chromosomes with short arms [13,14,15,21,22,chr13,chr14,chr15,chr21,chr22]\n        --use-short-arms            use variants in short arms [FALSE]\n        --use-centromeres           use variants in centromeres [FALSE]\n        --use-males-xtr             use variants in XTR region for males [FALSE]\n        --use-males-par2            use variants in PAR2 region for males [FALSE]\n        --use-no-rules-chrs         use chromosomes without centromere rules  [FALSE]\n        --LRR-weight \u003cfloat\u003e        relative contribution from LRR for LRR+BAF  model [0.2]\n        --LRR-hap2dip \u003cfloat\u003e       difference between LRR for haploid and diploid [0.45]\n        --LRR-cutoff \u003cfloat\u003e        cutoff between LRR for haploid and diploid used to infer gender [estimated from X nonPAR]\n\nExamples:\n    bcftools +mocha -g GRCh37 -v ^exclude.bcf -p cnps.bed -c calls.tsv -z stats.tsv input.bcf\n    bcftools +mocha -g GRCh38 -o output.bcf -Ob -c calls.tsv -z stats.tsv --LRR-weight 0.5 input.bcf\n```\n\nInstallation\n============\n\nInstall basic tools (Debian/Ubuntu specific if you have admin privileges, see here for FreeBSD)\n```\nsudo apt install wget unzip git g++ zlib1g-dev samtools bedtools bcftools\n```\n\nOptionally, you can install these libraries to activate further HTSlib features\n```\nsudo apt install libbz2-dev libssl-dev liblzma-dev libgsl0-dev\n```\n\nPreparation steps\n```\nmkdir -p $HOME/bin $HOME/GRCh3{7,8} \u0026\u0026 cd /tmp\n```\n\nWe recommend compiling the source code but, wherever this is not possible, Linux x86_64 pre-compiled binaries are available for download [here](http://software.broadinstitute.org/software/mocha). However, notice that you will require BCFtools version 1.20 or newer\n\nDownload latest version of [HTSlib](http://github.com/samtools/htslib) and [BCFtools](http://github.com/samtools/bcftools) (if not downloaded already)\n```\nwget http://github.com/samtools/bcftools/releases/download/1.20/bcftools-1.20.tar.bz2\ntar xjvf bcftools-1.20.tar.bz2\n```\n\nDownload and compile plugins code (make sure you are using gcc version 5 or newer)\n```\ncd bcftools-1.20/\n/bin/rm -f plugins/{{mocha,beta_binom,genome_rules}.h,{mocha,mochatools,extendFMT}.c}\nwget -P plugins http://raw.githubusercontent.com/freeseek/mocha/master/{{mocha,beta_binom,genome_rules}.h,{mocha,mochatools,extendFMT}.c}\nmake\n/bin/cp bcftools plugins/{fill-tags,fixploidy,mocha,mochatools,extendFMT}.so $HOME/bin/\n```\n\nMake sure the directory with the plugins is available to BCFtools\n```\nexport PATH=\"$HOME/bin:$PATH\"\nexport BCFTOOLS_PLUGINS=\"$HOME/bin\"\n```\n\nInstall IMPUTE5 from [here](http://www.dropbox.com/sh/mwnceyhir8yze2j/AADbzP6QuAFPrj0Z9_I1RSmla?dl=0) and Beagle5 (optional for array data)\n```\nwget -O impute5_v1.2.0.zip http://www.dropbox.com/sh/mwnceyhir8yze2j/AABKBCgZsQqz8TlZGo7yXwx6a/impute5_v1.2.0.zip?dl=0\nunzip -ojd $HOME/bin impute5_v1.2.0.zip impute5_v1.2.0/{impute5_v1.2.0,xcftools}_static\nchmod a+x $HOME/bin/{impute5_v1.2.0,xcftools}_static\nln -s impute5_v1.2.0_static $HOME/bin/impute5\nsudo apt install beagle\n```\n\nDownload resources for GRCh37\n=============================\n\nYou can find the required GRCh37 resources [here](http://software.broadinstitute.org/software/mocha) or you can generate them as follows\n\nHuman genome reference\n```\nwget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | \\\n  gzip -d \u003e $HOME/GRCh37/human_g1k_v37.fasta\nsamtools faidx $HOME/GRCh37/human_g1k_v37.fasta\n```\n\nGenetic map\n```\nwget -P $HOME/GRCh37 http://data.broadinstitute.org/alkesgroup/Eagle/downloads/tables/genetic_map_hg19_withX.txt.gz\n```\n\n1000 Genomes project low coverage phase 3\n```\ncd $HOME/GRCh37\nwget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr{{1..22}.phase3_shapeit2_mvncall_integrated_v5b,X.phase3_shapeit2_mvncall_integrated_v1c,Y.phase3_integrated_v2b}.20130502.genotypes.vcf.gz{,.tbi}\nfor chr in {1..22} X Y; do\n  bcftools view --no-version -Ou -c 2 ALL.chr${chr}.phase3*integrated_v[125][bc].20130502.genotypes.vcf.gz | \\\n  bcftools annotate --no-version -Ou -x ID,QUAL,FILTER,^INFO/AC,^INFO/AN,INFO/END,^FMT/GT | \\\n  bcftools norm --no-version -Ou -m -any | \\\n  bcftools norm --no-version -Ou -d none -f $HOME/GRCh37/human_g1k_v37.fasta | \\\n  bcftools sort -o ALL.chr${chr}.phase3_integrated.20130502.genotypes.bcf -Ob -T ./bcftools. --write-index\ndone\n```\n\nSites only VCF\n```\nbcftools concat --no-version -Ou ALL.chr{{1..22},X}.phase3_integrated.20130502.genotypes.bcf | \\\n  bcftools view --no-version -G -Ob -o ALL.phase3_integrated.20130502.sites.bcf --write-index\n```\n\nList of common germline duplications and deletions\n```\nwget -P $HOME/GRCh37 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.vcf.gz{,.tbi}\nbcftools query -i 'AC\u003e1 \u0026\u0026 END-POS+1\u003e10000 \u0026\u0026 SVTYPE!=\"INDEL\" \u0026\u0026 (SVTYPE==\"CNV\" || SVTYPE==\"DEL\" || SVTYPE==\"DUP\")' \\\n  -f \"%CHROM\\t%POS0\\t%END\\t%SVTYPE\\n\" $HOME/GRCh37/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.vcf.gz \u003e $HOME/GRCh37/cnps.bed\n```\n\nMinimal divergence intervals from segmental duplications (make sure your bedtools version is 2.27 or newer)\n```\nwget -O- http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/genomicSuperDups.txt.gz | gzip -d |\n  awk '!($2==\"chrX\" \u0026\u0026 $8==\"chrY\" || $2==\"chrY\" \u0026\u0026 $8==\"chrX\") {print $2\"\\t\"$3\"\\t\"$4\"\\t\"$30}' \u003e genomicSuperDups.bed\n\nawk '{print $1,$2; print $1,$3}' genomicSuperDups.bed | \\\n  sort -k1,1 -k2,2n | uniq | \\\n  awk 'chrom==$1 {print chrom\"\\t\"pos\"\\t\"$2} {chrom=$1; pos=$2}' | \\\n  bedtools intersect -a genomicSuperDups.bed -b - | \\\n  bedtools sort | \\\n  bedtools groupby -c 4 -o min | \\\n  awk 'BEGIN {i=0; s[0]=\"+\"; s[1]=\"-\"} {if ($4!=x) i=(i+1)%2; x=$4; print $0\"\\t0\\t\"s[i]}' | \\\n  bedtools merge -s -c 4 -o distinct | \\\n  sed 's/^chr//' | grep -v gl | bgzip \u003e $HOME/GRCh37/segdups.bed.gz \u0026\u0026 \\\n  tabix -f -p bed $HOME/GRCh37/segdups.bed.gz\n```\n\n1000 Genomes project low coverage phase 3 imputation panel for IMPUTE5\n```\ncd $HOME/GRCh37\npfx=\"ALL.chr\"\nsfx=\".phase3_integrated.20130502.genotypes\"\nfor chr in {{1..22},X}; do xcftools view --input $pfx$chr$sfx.bcf --region $chr --maf .03125 --output $pfx$chr$sfx.xcf.bcf --format sh; done\nfor chr in {1..22}; do bcftools view --no-version $pfx$chr$sfx.bcf | bref3 \u003e $pfx$chr$sfx.bref3; done\nchr=X; bcftools +fixploidy --no-version $pfx$chr$sfx.bcf | \\\n  sed 's/0\\/0/0|0/g;s/1\\/1/1|1/g' | bref3 \u003e $pfx$chr$sfx.bref3\n```\n\nDownload cytoband file\n```\nwget -P $HOME/GRCh37 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz\n```\n\nSetup variables\n```\nref=\"$HOME/GRCh37/human_g1k_v37.fasta\"\nmhc_reg=\"6:27486711-33448264\"\nkir_reg=\"19:54574747-55504099\"\nmap=\"$HOME/GRCh37/genetic_map_hg19_withX.txt.gz\"\npanel_pfx=\"$HOME/GRCh37/ALL.chr\"\npanel_sfx=\".phase3_integrated.20130502.genotypes\"\nassembly=\"GRCh37\"\ncnp=\"$HOME/GRCh37/cnps.bed\"\ndup=\"$HOME/GRCh37/segdups.bed.gz\"\ncyto=\"$HOME/GRCh37/cytoBand.txt.gz\"\n```\n\nDownload resources for GRCh38\n=============================\n\nYou can find the required GRCh38 resources [here](http://software.broadinstitute.org/software/mocha) or you can generate them as follows\n\nHuman genome reference (following the suggestion from [Heng Li](http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use))\n```\nwget -O- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | \\\n  gzip -d \u003e $HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna\nsamtools faidx $HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna\n```\n\nGenetic map\n```\nwget -P $HOME/GRCh38 http://data.broadinstitute.org/alkesgroup/Eagle/downloads/tables/genetic_map_hg38_withX.txt.gz\n```\n\n1000 Genomes project high coverage\n```\ncd $HOME/GRCh38\nwget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr{{1..22}.filtered.SNV_INDEL_SV_phased_panel,X.filtered.SNV_INDEL_SV_phased_panel.v2}.vcf.gz\nfor chr in {1..22} X; do\n  if [ $chr == \"X\" ]; then sfx=\".v2\"; else sfx=\"\"; fi\n  bcftools view --no-version -Ou -c 2 1kGP_high_coverage_Illumina.chr$chr.filtered.SNV_INDEL_SV_phased_panel$sfx.vcf.gz | \\\n  bcftools annotate --no-version -Ou -x ID,QUAL,FILTER,^INFO/AC,^INFO/AN,^INFO/END,^FMT/GT | \\\n  bcftools sort -o 1kGP_high_coverage_Illumina.chr$chr.bcf -Ob -T ./bcftools. --write-index\ndone\n```\n\nSites only VCF\n```\nbcftools concat --no-version -Ou 1kGP_high_coverage_Illumina.chr{{1..22},X}.bcf | \\\n  bcftools view --no-version -G -Ob -o 1kGP_high_coverage_Illumina.sites.bcf --write-index\n```\n\nList of common germline duplications and deletions\n```\nfor chr in {1..22} X; do\n  if [ $chr == \"X\" ]; then sfx=\".v2\"; else sfx=\"\"; fi\n  bcftools query -i 'AC\u003e1 \u0026\u0026 END-POS+1\u003e10000 \u0026\u0026 (SVTYPE==\"CNV\" || SVTYPE==\"DEL\" || SVTYPE==\"DUP\")' \\\n  -f \"%CHROM\\t%POS0\\t%END\\t%SVTYPE\\n\" $HOME/GRCh38/1kGP_high_coverage_Illumina.chr$chr.filtered.SNV_INDEL_SV_phased_panel$sfx.vcf.gz\ndone \u003e $HOME/GRCh38/cnps.bed\n```\n\nList of commonly deleted regions\n```\nwget https://www.medrxiv.org/content/medrxiv/early/2025/07/30/2025.07.30.25332451/DC2/embed/media-2.xlsx\nxlsx2csv -d tab -s3 media-2.xlsx | tail -n+3 | head -n4 | awk -F\"\\t\" -v OFS=\"\\t\" '{print $1,$2*1e6,$3*1e6,80}' \u003e fras.bed\n```\n\nMinimal divergence intervals from segmental duplications (make sure your bedtools version is 2.27 or newer)\n```\nwget -O- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/genomicSuperDups.txt.gz | gzip -d |\n  awk '!($2==\"chrX\" \u0026\u0026 $8==\"chrY\" || $2==\"chrY\" \u0026\u0026 $8==\"chrX\") {print $2\"\\t\"$3\"\\t\"$4\"\\t\"$30}' \u003e genomicSuperDups.bed\n\nawk '{print $1,$2; print $1,$3}' genomicSuperDups.bed | \\\n  sort -k1,1 -k2,2n | uniq | \\\n  awk 'chrom==$1 {print chrom\"\\t\"pos\"\\t\"$2} {chrom=$1; pos=$2}' | \\\n  bedtools intersect -a genomicSuperDups.bed -b - | \\\n  bedtools sort | \\\n  bedtools groupby -c 4 -o min | \\\n  awk 'BEGIN {i=0; s[0]=\"+\"; s[1]=\"-\"} {if ($4!=x) i=(i+1)%2; x=$4; print $0\"\\t0\\t\"s[i]}' | \\\n  bedtools merge -s -c 4 -o distinct | \\\n  grep -v \"GL\\|KI\" | bgzip \u003e $HOME/GRCh38/segdups.bed.gz \u0026\u0026 \\\n  tabix -f -p bed $HOME/GRCh38/segdups.bed.gz\n```\n\n1000 Genomes project high coverage imputation panel for IMPUTE5\n```\ncd $HOME/GRCh38\npfx=\"1kGP_high_coverage_Illumina.\"\nsfx=\"\"\nfor chr in chr{{1..22},X}; do xcftools view --input $pfx$chr$sfx.bcf --region $chr --maf .03125 --output $pfx$chr$sfx.xcf.bcf --format sh; done\nfor chr in chr{1..22}; do bcftools view --no-version $pfx$chr$sfx.bcf | bref3 \u003e $pfx$chr$sfx.bref3; done\nchr=chrX; bcftools +fixploidy --no-version $pfx$chr$sfx.bcf | \\\n  sed 's/0\\/0/0|0/g;s/1\\/1/1|1/g' | bref3 \u003e $pfx$chr$sfx.bref3\n```\n\nDownload cytoband file\n```\nwget -P $HOME/GRCh38 http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz\n```\n\nSetup variables\n```\nref=\"$HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna\"\nmhc_reg=\"chr6:27518932-33480487\"\nkir_reg=\"chr19:54071493-54992731\"\nmap=\"$HOME/GRCh38/genetic_map_hg38_withX.txt.gz\"\npanel_pfx=\"$HOME/GRCh38/1kGP_high_coverage_Illumina.chr\"\npanel_sfx=\"\"\nassembly=\"GRCh38\"\ncnp=\"$HOME/GRCh38/cnps.bed\"\ndup=\"$HOME/GRCh38/segdups.bed.gz\"\ncyto=\"$HOME/GRCh38/cytoBand.txt.gz\"\n```\n\nPrepare data\n============\n\nPreparation steps\n```\nvcf=\"...\" # input VCF file with phased GT, LRR, and BAF\npfx=\"...\" # output prefix\nthr=\"...\" # number of threads to use\ncrt=\"...\" # tab delimited file with call rate information (first column sample ID, second column call rate)\nsex=\"...\" # tab delimited file with computed gender information (first column sample ID, second column gender: 1=male; 2=female)\nxcl=\"...\" # VCF file with additional list of variants to exclude (optional)\nped=\"...\" # pedigree file to use if parent child duos are present\ndir=\"...\" # directory where output files will be generated\nmkdir -p $dir\n```\n\nIf you want to process \u003cb\u003egenotype array\u003c/b\u003e data you need a VCF file with ALLELE_A, ALLELE_B, GC, GT, BAF, and LRR information\n```\n##fileformat=VCFv4.2\n##INFO=\u003cID=ALLELE_A,Number=1,Type=Integer,Description=\"A allele\"\u003e\n##INFO=\u003cID=ALLELE_B,Number=1,Type=Integer,Description=\"B allele\"\u003e\n##INFO=\u003cID=GC,Number=1,Type=Float,Description=\"GC ratio content around the variant\"\u003e\n##FORMAT=\u003cID=GT,Number=1,Type=String,Description=\"Genotype\"\u003e\n##FORMAT=\u003cID=BAF,Number=1,Type=Float,Description=\"B Allele Frequency\"\u003e\n##FORMAT=\u003cID=LRR,Number=1,Type=Float,Description=\"Log R Ratio\"\u003e\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tNA12878\n1\t752566\trs3094315\tG\tA\t.\t.\tALLELE_A=1;ALLELE_B=0;GC=0.3675\tGT:BAF:LRR\t1|1:0.0111:-0.0798\n1\t776546\trs12124819\tA\tG\t.\t.\tALLELE_A=0;ALLELE_B=1;GC=0.435\tGT:BAF:LRR\t0|1:0.5441:0.4959\n1\t798959\trs11240777\tG\tA\t.\t.\tALLELE_A=1;ALLELE_B=0;GC=0.4075\tGT:BAF:LRR\t0|0:0.9712:0.2276\n1\t932457\trs1891910\tG\tA\t.\t.\tALLELE_A=1;ALLELE_B=0;GC=0.6425\tGT:BAF:LRR\t1|0:0.5460:-0.1653\n```\nMaking sure that BAF refers to the allele frequency of the reference allele if ALLELE_B=0 and of the alternate allele if ALLELE_B=1\n\nIf you do not already have a VCF file, but you have Illumina or Affymetrix genotype array data, you can use the [gtc2vcf](http://github.com/freeseek/gtc2vcf) tools to convert the data to VCF and you can use the mochatools plugin to fill the ALLELE_A/ALLELE_B/GC info fields. We discourage the use of Illumina [GTCtoVCF](http://github.com/Illumina/GTCtoVCF) or [Array Analysis CLI](http://support.illumina.com/array/array_software/ima-array-analysis-cli.html) tools to generate a compliant VCF. Alternatively you can use your own scripts\n\nCreate a minimal binary VCF\n```\nbcftools annotate --no-version -o $dir/$pfx.unphased.bcf -Ob --write-index $vcf \\\n  -x ID,QUAL,^INFO/ALLELE_A,^INFO/ALLELE_B,^INFO/AC,^INFO/GC,^FMT/GT,^FMT/BAF,^FMT/LRR\n```\n\nIf you want to process \u003cb\u003ewhole-genome sequence\u003c/b\u003e data you need a VCF file with GC, GT and AD information\n```\n##fileformat=VCFv4.2\n##INFO=\u003cID=AC,Number=A,Type=Integer,Description=\"ALT allele count\"\u003e\n##INFO=\u003cID=GC,Number=1,Type=Float,Description=\"GC ratio content around the variant\"\u003e\n##FORMAT=\u003cID=GT,Number=1,Type=String,Description=\"Genotype\"\u003e\n##FORMAT=\u003cID=AD,Number=R,Type=Integer,Description=\"Allelic depths for the ref and alt alleles in the order listed\"\u003e\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tNA12878\n1\t752566\trs3094315\tG\tA\t.\t.\tAC=2;GC=0.3675\tGT:AD\t1|1:0,31\n1\t776546\trs12124819\tA\tG\t.\t.\tAC=1;GC=0.435\tGT:AD\t0|1:21,23\n1\t798959\trs11240777\tG\tA\t.\t.\tAC=0;GC=0.4075\tGT:AD\t0|0:31,0\n1\t932457\trs1891910\tG\tA\t.\t.\tAC=1;GC=0.6425\tGT:AD\t1|0:18,14\n```\nMake sure that AD is a \"Number=R\" format field (this was introduced in version [4.2](http://samtools.github.io/hts-specs/VCFv4.2.pdf) of the VCF) or multi-allelic variants will not [split properly](http://github.com/samtools/bcftools/issues/360). If your VCF does not include the GC field, this can be added with the command\n```\nbcftools +mochatools --no-version -Ob $vcf -- -t GC -f $ref\n```\n\nCreate a minimal binary VCF (notice that you will need BCFtools version 1.11 or newer with implemented the [--keep-sum](http://github.com/samtools/bcftools/issues/360) option)\n```\nbcftools view --no-version -h $vcf | sed 's/^\\(##FORMAT=\u003cID=AD,Number=\\)\\./\\1R/' | \\\n  bcftools reheader -h /dev/stdin $vcf | \\\n  bcftools filter --no-version -Ou -e \"FMT/DP\u003c10 | FMT/GQ\u003c20\" --set-GT . | \\\n  bcftools annotate --no-version -Ou -x ID,QUAL,^INFO/GC,^FMT/GT,^FMT/AD | \\\n  bcftools norm --no-version -Ou -m -any --keep-sum AD | \\\n  bcftools norm --no-version -o $dir/$pfx.unphased.bcf -Ob -f $ref --write-index\n```\nThis will set to missing all genotypes that have low coverage or low genotyping quality, as these can cause issues\n\nGenerate a list of variants that will be excluded from modeling by both eagle and mocha (notice that you will need BCFtools version 1.11 or newer with implemented the F_MISSING option, or else you should drop that filter)\n```\nawk -F\"\\t\" '$2\u003c.97 {print $1}' $crt \u003e samples_xcl_list.txt; \\\necho '##INFO=\u003cID=JK,Number=1,Type=Float,Description=\"Jukes Cantor\"\u003e' | \\\n  bcftools annotate --no-version -Ou -a $dup -c CHROM,FROM,TO,JK -h /dev/stdin $dir/$pfx.unphased.bcf | \\\n  bcftools view --no-version -Ou -S ^samples_xcl_list.txt | \\\n  bcftools +fill-tags --no-version -Ou -t ^Y,MT,chrY,chrM -- -t ExcHet,F_MISSING | \\\n  bcftools view --no-version -Ou -G | \\\n  bcftools annotate --no-version -o $dir/$pfx.xcl.bcf -Ob --write-index \\\n    -i 'FILTER!=\".\" \u0026\u0026 FILTER!=\"PASS\" || INFO/JK\u003c.02 || INFO/ExcHet\u003c1e-6 || INFO/F_MISSING\u003e1-.97' \\\n    -x ^INFO/JK,^INFO/ExcHet,^INFO/F_MISSING\n/bin/rm samples_xcl_list.txt\n```\nThis command will create a list of variants falling within segmental duplications with low divergence (\u003c2%), high levels of missingness (\u003e3%), variants with excess heterozygosity (p\u003c1e-6). If you are using WGS data and you don't have a file with sex information, you can skip the quality control line using this information. When later running MoChA, sex will be imputed and a sex file can be computed from MoChA's output\n\nIf a file with additional variants to be excluded is available, further merge it with the generated list\n```\n/bin/mv $dir/$pfx.xcl.bcf $dir/$pfx.xcl.tmp.bcf \u0026\u0026 \\\n/bin/mv $dir/$pfx.xcl.bcf.csi $dir/$pfx.xcl.tmp.bcf.csi \u0026\u0026 \\\nbcftools merge --no-version -o $dir/$pfx.xcl.bcf -Ob -m none --write-index $dir/$pfx.xcl.tmp.bcf $xcl\n```\n\nPhase genotypes\n===============\n\nExtract genotypes and split by autosomes and chromosome X\n```\nbcftools isec --no-version -Ou --complement --exclude \"N_ALT\u003e1\" --write 1 $dir/$pfx.unphased.bcf $dir/$pfx.xcl.bcf | \\\n  bcftools view --no-version -Ou --min-ac 0 --exclude-uncalled | \\\n  bcftools annotate --no-version -Ou --remove ID,QUAL,^INFO/AC,^FMT/GT | \\\n  bcftools +scatter --no-version -Ob --output $dir --scatter $(echo chr{{1..22},X} | tr ' ' ',') --prefix $pfx.\n```\nIf you are using GRCh37 rather than GRCh38, use `--scatter $(echo {{1..22},X} | tr ' ' ',') --prefix $pfx.chr` instead\n\nPhase VCF file by chromosome with SHAPEIT5\n```\nfor chr in {1..22} X; do\n  bcftools index --force $dir/$pfx.chr$chr.bcf\n  zcat $map | sed 's/^23/X/' | awk -v chr=$chr '$1==chr {print $2,$3,$4}' \u003e $dir/genetic_map.chr$chr.txt\n  phase_common \\\n    --thread $thr \\\n    --input $dir/$pfx.chr$chr.bcf \\\n    --reference $panel_pfx${chr}$panel_sfx.bcf \\\n    --map $dir/genetic_map.chr$chr.txt \\\n    --region chr$chr \\\n    --output $dir/$pfx.chr$chr.pgt.bcf\ndone\n```\nIf you are using GRCh37 rather than GRCh38, use `--region $chr` instead. If pedigree information with duos or trios is available, you can improve the phased haplotypes with option `--pedigree`. If you are phasing genotypes from WGS data, you will also have to use the [phase_rare](http://odelaneau.github.io/shapeit5/docs/documentation/phase_rare) command\n\nEagle's [memory requirements](http://data.broadinstitute.org/alkesgroup/Eagle/#x1-100003.2) will depend on the number of samples in the target (Nt) and in the reference panel (Nr=2504), and the number of variants (M) in the largest contig, and will amount to 1.5(Nt+Nr)M bytes. The [running time](http://data.broadinstitute.org/alkesgroup/Eagle/#x1-110003.3) will be \\~1 minute of CPU time per genome for [reference-based phasing](http://www.nature.com/articles/ng.3679#Sec18) with a small target and reference panel (see Supplementary Tables 2,3) and \\~5 minutes of CPU time per genome for [non-reference-based phasing](http://www.nature.com/articles/ng.3679#Sec18) with a large cohort (see Supplementary Tables 7,8). Also, by default, if the option --pbwtIters is not used, Eagle will perform one phasing iteration if Nt\u003cNr/2=1252, two if 1252=Nr/2\u003cNt\u003c2Nr=5008, and three if 5008=2Nr\u003cNt and in the second and third iterations both target and reference panel haplotypes will be used as references for phasing (see [here](http://www.nature.com/articles/ng.3679#Sec10))\n\nNotice that you can also use alternative phasing methods that might be more effective, such as using [HRC](http://www.haplotype-reference-consortium.org/) (use the Sanger Imputation Service, as the Michigan Imputations Server does not work with binary VCFs, does not work with VCFs with multiple chromosomes, does not work with chromosome X, and has no option for phasing without imputation). This might provide better phasing and therefore better ability to detect large events at lower cell fractions. Notice also that phasing can also be performed across overlapping windows rather than entire chromosomes to achieve better parallelization which can then be ligated together using either `bcftools concat --ligate` or `ligate` from SHAPEIT5 (but make sure you are avoiding version 5.1.1 or older as the corresponding ligation step is broken and only [fixed](http://github.com/odelaneau/shapeit5/commit/f942990d9906edb0f8a5bf10d34db3af7c707a02) in later versions)\n\nConcatenate phased output into a single VCF file\n```\nbcftools concat --no-version -o $dir/$pfx.pgt.bcf -Ob --write-index $dir/$pfx.chr{{1..22},X}.pgt.bcf\n```\nNotice that if the phasing was made in overlapping windows rather than chromosomes, the overlapping windows should be concatenated using the `--ligate` option in bcftools concat\n\nImport phased genotypes in the original VCF without changing missing genotypes\n```\nbcftools annotate --no-version -o $dir/$pfx.bcf -Ob --annotations $dir/$pfx.pgt.bcf --columns -FMT/GT --write-index $dir/$pfx.unphased.bcf\n```\n\nImpute variants using impute5 (optional for array data)\n```\nfor chr in {1..22} X; do\n  zcat $map | sed 's/^23/X/' | awk -v chr=$chr '$1==chr {print chr,\".\",$4,$2}' \u003e $dir/genetic_map.chr$chr.txt\n  impute5 \\\n    --h $panel_pfx$chr$panel_sfx.bcf \\\n    --m $dir/genetic_map.chr$chr.txt \\\n    --g $dir/$pfx.pgt.bcf \\\n    --r chr$chr \\\n    --buffer-region chr$chr \\\n    --o $dir/$pfx.chr$chr.imp.bcf \\\n    --l $dir/$pfx.chr$chr.log \\\n    --threads $thr\ndone\n```\nIf you are using GRCh37 rather than GRCh38, use `--r $chr` instead\n\nConcatenate imputed genotypes into a single VCF file (optional for array data)\n```\nbcftools concat --no-version -o $dir/$pfx.imp.bcf -Ob --write-index $dir/$pfx.chr{{1..22},X}.imp.bcf\n```\n\nRemove unphased VCF and single chromosome files (optional)\n```\n/bin/rm $dir/{genetic_map.chr{{1..22},X}.txt,$pfx.{unphased.bcf{,.csi},chr{{1..22},X}.{bcf{,.csi},{pgt,imp}.bcf,log}}}\n```\n\nCall chromosomal alterations\n============================\n\nPreparation steps\n```\npfx=\"...\" # output prefix\ntsv=\"...\" # file with sample statistics (sample_id, computed_gender, call_rate)\nlst=\"...\" # file with list of samples to analyze for asymmetries (e.g. samples with 1p CN-LOH)\ncnp=\"...\" # file with list of regions to genotype in BED format\nmhc_reg=\"...\" # MHC region to skip\nkir_reg=\"...\" # KIR region to skip\n```\n\nCall mosaic chromosomal alterations with MoChA\n```\nbcftools +mocha \\\n  --genome $assembly \\\n  --input-stats $tsv \\\n  --no-version \\\n  --output $dir/$pfx.as.bcf \\\n  --output-type b \\\n  --variants ^$dir/$pfx.xcl.bcf \\\n  --calls $dir/$pfx.calls.tsv \\\n  --stats $dir/$pfx.stats.tsv \\\n  --ucsc-bed $dir/$pfx.ucsc.bed \\\n  --write-index \\\n  --cnp $cnp \\\n  --mhc $mhc_reg \\\n  --kir $kir_reg \\\n  $dir/$pfx.bcf\n```\nNotice that MoChA will read input computed gender and call rate with the `--input-stats` option if provided, otherwise these will be estimated from the VCF. MoChA requires a balanced ratio of males and females to correctly infer gender so if this is not the case, we advise to input the gender with the `--input-stats` option. This option requires a file with columns `sample_id` and `computed_gender`, with the latter being encoded as `M` for male, `F` for female, and `U` for unknown (optionally `K` for Klinefelter). For array data these statistics are usually available from the output of the Illumina\\'s GenCall or Affymetrix\\'s Axiom genotyping algorithms. MoChA should not be run on single chromosome VCFs as median statistics across the autosomes are used to calibrate the likelihoods\n\nThe genome statistics file contains information for each sample analyzed in the VCF and it includes the following columns\n```\n            sample_id - sample ID\n      computed_gender - estimated sample gender from X nonPAR region (not heterozygous sites count)\n            call_rate - estimated genotype calling rate\n           XXX_median - median LRR or sequencing coverage across autosomes\n               XXX_sd - standard deviation for LRR or sequencing coverage\n             XXX_auto - auto correlation across consecutive sites for LRR or sequencing coverage (after GC correction)\n          baf_sd/_cor - BAF standard deviation or beta-binomial overdispersion for read counts\n             baf_conc - BAF phase concordance across phased heterozygous sites (see Vattathil et al. 2012)\n             baf_auto - phased BAF auto correlation across consecutive phased heterozygous sites\n              n_sites - number of sites across the genome for model based on LRR and BAF\n               n_hets - number of heterozygous sites across the genome for model based on BAF and genotype phase\n      x_nonpar_n_hets - number of heterozygous sites in the X nonPAR region\n          par1_n_hets - number of heterozygous sites in the PAR1 region\n           xtr_n_hets - number of heterozygous sites in the XTR region\n          par2_n_hets - number of heterozygous sites in the PAR2 region\nx_nonpar_baf_sd/_corr - BAF standard deviation or beta-binomial overdispersion for read counts in the X nonPAR region\n  x_nonpar_XXX_median - median LRR or sequencing coverage over the X nonPAR region\n  y_nonpar_XXX_median - median LRR or sequencing coverage over the Y nonPAR region\n        mt_XXX_median - median LRR or sequencing coverage over the mitochondrial genome\n       lrr_gc_rel_ess - LRR or sequencing coverage explained sum of squares fraction using local GC content\n             lrr_gc_X - coefficient X for polynomial in GC content fitting LRR estimates\n```\n\nThe mosaic calls file contains information about each mosaic and germline chromosomal alteration called and it includes the following columns\n```\n      sample_id - sample ID\ncomputed_gender - inferred sample gender\n          chrom - chromosome\n     beg_XXXXXX - beginning base pair position for the call (according to XXXXXX genome reference)\n     end_XXXXXX - end base pair position for the call (according to XXXXXX genome reference)\n         length - base pair length of the call\n          p_arm - whether or not the call extends to the small arm (Y/N) and whether it reaches the telomere (T) or just the centromere (C)\n          q_arm - whether or not the call extends to the long arm (Y/N) and whether it reaches the telomere (T) or just the centromere (C)\n        n_sites - number of sites used for the call\n         n_hets - number of heterozygous sites used for the call\n       n50_hets - N50 value for consecutive heterozygous sites distances\n           bdev - BAF deviation estimate from 0.5\n        bdev_se - standard deviation estimate for BAF deviation\n        rel_cov - relative coverage estimate from LRR or sequencing coverage\n     rel_cov_se - standard deviation estimate for relative coverage\n    lod_lrr_baf - LOD score for model based on LRR and BAF\n  lod_baf_phase - LOD score for model based on BAF and genotype phase\n        n_flips - number of phase flips for calls based on BAF and genotype phase model (-1 if LRR and BAF model used)\n       baf_conc - BAF phase concordance across phased heterozygous sites underlying the call (see Vattathil et al. 2012)\n   lod_baf_conc - LOD score for model based on BAF phase concordance (genome-wide corrected)\n           type - Type of call based on LRR / relative coverage\n             cf - estimated cell fraction based on bdev and type, or rel_cov and type if either bdev or bdev_se are missing\n```\nNotice that the cell fraction is computed as either `2 bdev` for CN-LOH events or using the formulas `| 1/cn - 1/2 | = bdev` with `cn` the copy number and `cf = | 2 - cn |` for gains and losses. If the type of event cannot be determined, it will be determined as `4 bdev` if `bdev \u003c 0.05` otherwise it will not be estimated. The `rel_cov` statistic is estimated as `2 x 2 ^ (LRR / LRR-hap2dip)` with `LRR-hap2dip = 0.45` by default. Notice also that the two most common events, mosaic loss of Y (mLOY or LOY) and mosaic loss of X (mLOX or LOX), are always called as X chromosome events as mLOY is identified through an imbalance in the PAR1 region that is by default mapped to the X chromosome. Furthermore, LRR measurement on the X chromosome are typically very noisy and therefore inference of event type, whether gain, loss, or CN-LOH, for X chromosome events at low cell fraction should not be relied upon. For events inferred to be mLOY and mLOX events, we advise to estimate the cell fraction using the formula for losses: `cf = 4 bdev / (1 + 2 bdev)`\n\nThe output VCF will contain the following extra FORMAT fields\n```\nLdev - LRR deviation estimate\nBdev - BAF deviation estimate\n  AS - Allelic shift for heterozygous calls: 1/-1 if the alternate allele is over/under represented\n```\n\nFor array data, MoChA's memory requirements will depend on the number of samples (N) and the number of variants (M) in the largest contig and will amount to at most 18NM bytes. For example, if you are running 4,000 samples and chromosome 1 has \\~80K variants, you will need approximately 2-3GB to run MoChA. It will take \\~1/3 second of CPU time per genome to process samples genotyped on the Illumina GSA DNA microarray. For whole genome sequence data, MoChA's memory requirements will depend on the number of samples (N), the `--min-dist` parameter (D, 400 by default) and the length of the longest contig (L) and will amount to at most 18NL/D, but could be significantly less, depending on how many variants you have in the VCF. If you are running 1,000 samples with default parameter `--min-dist 400` and chromosome 1 is \\~250Mbp long, you might need up to 5-6GB to run MoChA. For whole genome sequence data there is no need to batch too many samples together, as batching will not affect the calls made by MoChA (it will for array data unless you use options `--adjust-BAF-LRR -1` and `--regress-BAF-LRR -1`). Notice that the CPU requirements for MoChA will be negligible compared to the CPU requirements for phasing with Eagle\n\nFilter callset\n==============\n\nDepending on your application, you might want to filter the calls from MoChA. For example, the following code\n```\nawk -F \"\\t\" 'NR==FNR \u0026\u0026 FNR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i}\n  NR==FNR \u0026\u0026 FNR\u003e1 \u0026\u0026 ($(f[\"call_rate\"])\u003c.97 || $(f[\"baf_auto\"])\u003e.03) {xcl[$(f[\"sample_id\"])]++}\n  NR\u003eFNR \u0026\u0026 FNR==1 {for (i=1; i\u003c=NF; i++) g[$i] = i; print}\n  NR\u003eFNR \u0026\u0026 FNR\u003e1 {gender=$(g[\"computed_gender\"]); len=$(g[\"length\"]); bdev=$(g[\"bdev\"]);\n  rel_cov=$(g[\"rel_cov\"]); lod_baf_phase=$(g[\"lod_baf_phase\"]); lod_baf_conc=$(g[\"lod_baf_conc\"])}\n  NR\u003eFNR \u0026\u0026 FNR\u003e1 \u0026\u0026 !($(g[\"sample_id\"]) in xcl) \u0026\u0026 $(g[\"type\"])!~\"^CNP\" \u0026\u0026\n    ( $(g[\"chrom\"])~\"X\" \u0026\u0026 gender==\"M\" || bdev\u003c0.1 || $(g[\"n50_hets\"])\u003c2e5 || lod_baf_conc!=\"nan\" \u0026\u0026 lod_baf_conc\u003e10.0 ) \u0026\u0026\n    ( $(g[\"bdev_se\"])!=\"nan\" || lod_baf_phase!=\"nan\" \u0026\u0026 lod_baf_phase\u003e10.0 ) \u0026\u0026\n    ( rel_cov\u003c2.1 || bdev\u003c0.05 || len\u003e5e5 \u0026\u0026 bdev\u003c0.1 \u0026\u0026 rel_cov\u003c2.5 || len\u003e5e6 \u0026\u0026 bdev\u003c0.15 )' \\\n  $pfx.stats.tsv $pfx.calls.tsv \u003e $pfx.calls.filtered.tsv\nawk 'NR==FNR {x[$1\"_\"$3\"_\"$4\"_\"$5]++} NR\u003eFNR \u0026\u0026 ($0~\"^track\" || $4\"_\"$1\"_\"$2\"_\"$3 in x)' \\\n  $pfx.calls.filtered.tsv $pfx.ucsc.bed \u003e $pfx.ucsc.filtered.bed\n```\nwill generate a new table after removing samples with `call_rate` lower than 0.97 `baf_auto` greater than 0.03, removing calls made by the LRR and BAF model if they have less than a `lod_baf_phase` score of 10 for the model based on BAF and genotype phase, removing calls flagged as germline copy number polymorphisms (CNPs), and removing calls that are likely germline duplications similarly to how it was done in the [UK biobank](http://doi.org/10.1038/s41586-018-0321-x). Notice that different filtering thresholds are used for calls smaller than 500kbp and smaller than 5Mbp, reflecting different priors that these could be germline events. Most calls on chromosome X in male samples likely represent mosaic loss-of-Y events, as only the PAR1 regions is analyzed in male samples\n\nGenerate mosaic phenotypes\n==========================\n\nFor additional downstream analyses, we can generate phenotypes to analyze. Generate a list of samples to exclude from association analyses\n```\nawk -F \"\\t\" 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i}\n  NR\u003e1 \u0026\u0026 ($(f[\"call_rate\"])\u003c.97 || $(f[\"baf_auto\"])\u003e.03) {print $(f[\"sample_id\"])}' $pfx.stats.tsv \u003e $pfx.remove.lines\n```\n\nGenerate list of samples with mosaic loss of chromosome Y (mLOY or LOY)\n```\nawk -F \"\\t\" 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i} NR\u003e1 \u0026\u0026 $(f[\"computed_gender\"])==\"M\" \u0026\u0026 $(f[\"chrom\"])~\"X\" \u0026\u0026\n  $(f[\"length\"])\u003e2e6 \u0026\u0026 $(f[\"rel_cov\"])\u003c2.5 {print $(f[\"sample_id\"])}' $pfx.calls.tsv \u003e $pfx.Y_loss.lines\n```\nRequiring `rel_cov\u003c2.5` should make sure to filter out XXY and XYY samples. This should generate a mLOY set similarly to how it was done in the [UK biobank](http://doi.org/10.1038/s41598-020-59963-8). Notice that this inference strategy is based on BAF imbalances over the PAR1 region which allows detection of loss-of-Y at much lower cell fractions that by using LRR statistics over the Y nonPAR region\n\nGenerate list of samples with mosaic loss of chromosome X (mLOX or LOX)\n```\nawk -F \"\\t\" 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i}\n  NR\u003e1 \u0026\u0026 ($(f[\"computed_gender\"])==\"F\" || $(f[\"computed_gender\"])==\"K\") \u0026\u0026 $(f[\"chrom\"])~\"X\" \u0026\u0026\n  $(f[\"length\"])\u003e1e8 \u0026\u0026 $(f[\"rel_cov\"])\u003c2.5 {print $(f[\"sample_id\"])}' $pfx.calls.tsv \u003e $pfx.X_loss.lines\nawk -F \"\\t\" 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i}\n  NR\u003e1 \u0026\u0026 ($(f[\"computed_gender\"])==\"F\" || $(f[\"computed_gender\"])==\"K\") \u0026\u0026 $(f[\"chrom\"])~\"X\" \u0026\u0026\n  $(f[\"length\"])\u003e1e8 \u0026\u0026 $(f[\"bdev\"])\u003e.01 \u0026\u0026 $(f[\"rel_cov\"])\u003c2.5 {print $(f[\"sample_id\"])}' $pfx.calls.tsv \u003e $pfx.X_loss_high.lines\nawk -F \"\\t\" 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i}\n  NR\u003e1 \u0026\u0026 ($(f[\"computed_gender\"])==\"F\" || $(f[\"computed_gender\"])==\"K\") \u0026\u0026 $(f[\"chrom\"])~\"X\" \u0026\u0026\n  $(f[\"length\"])\u003e1e8 \u0026\u0026 $(f[\"bdev\"])\u003c=.01 \u0026\u0026 $(f[\"rel_cov\"])\u003c2.5 {print $(f[\"sample_id\"])}' $pfx.calls.tsv \u003e $pfx.X_loss_low.lines\n```\nRequiring `rel_cov\u003c2.5` should make sure to filter out XXY and XXX samples. Notice that we do not require that the event be identified as a loss by MoChA. MoChA determines the event type based on LRR median statistics and as we have observed that the LRR on chromosome X is quite noisy, for low cell fraction chromosome X calls determinining the type of event based on LRR is not reliable. Notice that this inference strategy is based on BAF imbalances over whole chromosome X which allows detection of loss-of-X at much lower cell fractions than by using LRR statistics over the X nonPAR region\n\nGenerate list of non-germline events\n```\nawk -F \"\\t\" 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i; print}\n  NR\u003e1 {len=$(f[\"length\"]); bdev=$(f[\"bdev\"]); rel_cov=$(f[\"rel_cov\"])}\n  NR\u003e1 \u0026\u0026 $(f[\"type\"])!~\"^CNP\" \u0026\u0026\n  ( $(f[\"chrom\"])~\"X\" \u0026\u0026 $(f[\"computed_gender\"])==\"M\" || bdev\u003c0.1 || $(f[\"n50_hets\"])\u003c2e5 ) \u0026\u0026\n  ( $(f[\"bdev_se\"])!=\"nan\" || $(f[\"lod_baf_phase\"])!=\"nan\" \u0026\u0026 $(f[\"lod_baf_phase\"]) \u003e 10.0 ) \u0026\u0026\n  ( rel_cov\u003c2.1 || bdev\u003c0.05 || len\u003e5e5 \u0026\u0026 bdev\u003c0.1 \u0026\u0026 rel_cov\u003c2.5 || len\u003e5e6 \u0026\u0026 bdev\u003c0.15 )' \\\n  $pfx.calls.tsv \u003e $pfx.mca.calls.tsv\nawk -F\"\\t\" -v OFS=\"\\t\" 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i}\n  NR\u003e1 \u0026\u0026 $(f[\"chrom\"])!=\"X\" \u0026\u0026 $(f[\"chrom\"])!=\"chrX\" \u0026\u0026 $(f[\"rel_cov\"])\u003e1 {\n  x=$(f[\"bdev\"]); y=(1/($(f[\"rel_cov\"])-1)-1)/2; if (y*y\u003cx*x) print $(f[\"sample_id\"])}' \\\n  $pfx.mca.calls.tsv \u003e $pfx.cnloh.lines\nawk -F\"\\t\" -v OFS=\"\\t\" 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i}\n  NR\u003e1 \u0026\u0026 $(f[\"chrom\"])!=\"X\" \u0026\u0026 $(f[\"chrom\"])!=\"chrX\" \u0026\u0026 $(f[\"rel_cov\"])\u003e1 {\n  x=$(f[\"bdev\"]); y=(1/($(f[\"rel_cov\"])-1)-1)/2; if (y\u003ex) print $(f[\"sample_id\"])}' \\\n  $pfx.mca.calls.tsv \u003e $pfx.loss.lines\nawk -F\"\\t\" -v OFS=\"\\t\" 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i}\n  NR\u003e1 \u0026\u0026 $(f[\"chrom\"])!=\"X\" \u0026\u0026 $(f[\"chrom\"])!=\"chrX\" \u0026\u0026 $(f[\"rel_cov\"])\u003e1 {\n  x=$(f[\"bdev\"]); y=(1/($(f[\"rel_cov\"])-1)-1)/2; if (y\u003c-x) print $(f[\"sample_id\"])}' \\\n  $pfx.mca.calls.tsv \u003e $pfx.gain.lines\n```\n\nGenerate list of samples with mosaic autosomal alterations\n```\nawk -F\"\\t\" -v OFS=\"\\t\" 'NR==FNR \u0026\u0026 $3!=\"X\" \u0026\u0026 $3!=\"chrX\" {x[$1]++} NR\u003eFNR \u0026\u0026 $1 in x {print $1}' $pfx.mca.calls.tsv $pfx.stats.tsv \u003e $pfx.auto.lines\nfor chr in {1..12} {16..20}; do\n  awk -F\"\\t\" -v OFS=\"\\t\" -v chr=$chr 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i}\n    NR\u003e1 \u0026\u0026 ($(f[\"chrom\"])==chr || $(f[\"chrom\"])==\"chr\"chr) \u0026\u0026 $(f[\"p_arm\"])==\"T\" \u0026\u0026 $(f[\"q_arm\"])!=\"T\" \u0026\u0026 $(f[\"rel_cov\"])\u003e1 {\n    x=$(f[\"bdev\"]); y=(1/($(f[\"rel_cov\"])-1)-1)/2; if (y*y\u003cx*x) print $(f[\"sample_id\"])}' \\\n    $pfx.mca.calls.tsv \u003e $pfx.${chr}p_cnloh.lines\n  awk -F\"\\t\" -v OFS=\"\\t\" -v chr=$chr 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i}\n    NR\u003e1 \u0026\u0026 ($(f[\"chrom\"])==chr || $(f[\"chrom\"])==\"chr\"chr) \u0026\u0026 $(f[\"p_arm\"])!=\"N\" \u0026\u0026 $(f[\"rel_cov\"])\u003e1 {\n    x=$(f[\"bdev\"]); y=(1/($(f[\"rel_cov\"])-1)-1)/2; if (y\u003ex) print $(f[\"sample_id\"])}' \\\n    $pfx.mca.calls.tsv \u003e $pfx.${chr}p_loss.lines\n  awk -F\"\\t\" -v OFS=\"\\t\" -v chr=$chr 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i}\n    NR\u003e1 \u0026\u0026 ($(f[\"chrom\"])==chr || $(f[\"chrom\"])==\"chr\"chr) \u0026\u0026 $(f[\"p_arm\"])==\"T\" \u0026\u0026 $(f[\"q_arm\"])==\"T\" \u0026\u0026 $(f[\"rel_cov\"])\u003e1 {\n    x=$(f[\"bdev\"]); y=(1/($(f[\"rel_cov\"])-1)-1)/2; if (y\u003c-x) print $(f[\"sample_id\"])}' \\\n    $pfx.mca.calls.tsv \u003e $pfx.${chr}_gain.lines\ndone\nfor chr in {1..22}; do\n  awk -F\"\\t\" -v OFS=\"\\t\" -v chr=$chr 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i}\n    NR\u003e1 \u0026\u0026 ($(f[\"chrom\"])==chr || $(f[\"chrom\"])==\"chr\"chr) \u0026\u0026 $(f[\"p_arm\"])!=\"T\" \u0026\u0026 $(f[\"q_arm\"])==\"T\" \u0026\u0026 $(f[\"rel_cov\"])\u003e1 {\n    x=$(f[\"bdev\"]); y=(1/($(f[\"rel_cov\"])-1)-1)/2; if (y*y\u003cx*x) print $(f[\"sample_id\"])}' \\\n    $pfx.mca.calls.tsv \u003e $pfx.${chr}q_cnloh.lines\n  awk -F\"\\t\" -v OFS=\"\\t\" -v chr=$chr 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i}\n    NR\u003e1 \u0026\u0026 ($(f[\"chrom\"])==chr || $(f[\"chrom\"])==\"chr\"chr) \u0026\u0026 $(f[\"q_arm\"])!=\"N\" \u0026\u0026 $(f[\"rel_cov\"])\u003e1 {\n    x=$(f[\"bdev\"]); y=(1/($(f[\"rel_cov\"])-1)-1)/2; if (y\u003ex) print $(f[\"sample_id\"])}' \\\n    $pfx.mca.calls.tsv \u003e $pfx.${chr}q_loss.lines\ndone\nfor chr in 13 14 15 21 22; do\n  awk -F\"\\t\" -v OFS=\"\\t\" -v chr=$chr 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i}\n    NR\u003e1 \u0026\u0026 ($(f[\"chrom\"])==chr || $(f[\"chrom\"])==\"chr\"chr) \u0026\u0026 $(f[\"p_arm\"])==\"C\" \u0026\u0026 $(f[\"q_arm\"])==\"T\" \u0026\u0026 $(f[\"rel_cov\"])\u003e1 {\n    x=$(f[\"bdev\"]); y=(1/($(f[\"rel_cov\"])-1)-1)/2; if (y\u003c-x) print $(f[\"sample_id\"])}' \\\n    $pfx.mca.calls.tsv \u003e $pfx.${chr}_gain.lines\ndone\n```\n\nGenerate aggregate lists of mCA, in part following the work from [Niroula et al. 2021](http://doi.org/10.1038/s41591-021-01521-4):\n```\ncat $pfx.{{3,4,6,8,11,17,18}p_loss,{1,6,11,13,14,16,17,22}q_loss,13q_cnloh,{2,3,4,5,12,17,18,19}_gain}.lines | \\\n  awk -F\"\\t\" -v OFS=\"\\t\" 'NR==FNR {x[$1]++} NR\u003eFNR \u0026\u0026 $1 in x {print $1}' - $pfx.stats.tsv \u003e $pfx.cll.lines\ncat $pfx.{{5,12,20}q_loss,{9p,14q,22q}_cnloh,{1,8}_gain}.lines  | \\\n  awk -F\"\\t\" -v OFS=\"\\t\" 'NR==FNR {x[$1]++} NR\u003eFNR \u0026\u0026 $1 in x {print $1}' - $pfx.stats.tsv \u003e $pfx.myeloid.lines\ncat $pfx.{{1,8,10,17}p_loss,{1,6,7,10,11,13,14,15,22}q_loss,{1q,7q,9q,12q,13q,16p}_cnloh,{2,3,9,12,15,17,18,19,22}_gain}.lines  | \\\n  awk -F\"\\t\" -v OFS=\"\\t\" 'NR==FNR {x[$1]++} NR\u003eFNR \u0026\u0026 $1 in x {print $1}' - $pfx.stats.tsv \u003e $pfx.lymphoid.lines\n```\n\nGenerate phenotype table with mosaic loss of chromosome X and Y\n```\n(echo -e \"sample_id\\tY_loss\\tX_loss\\tX_loss_high\\tX_loss_low\"\nawk -F\"\\t\" -v OFS=\"\\t\" 'NR==1 {for (i=1; i\u003c=NF; i++) f[$i] = i} NR\u003e1 {print $(f[\"sample_id\"]),$(f[\"computed_gender\"])}' $pfx.stats.tsv | \\\n  awk -F\"\\t\" -v OFS=\"\\t\" 'NR==FNR {x[$1]=1} NR\u003eFNR {if ($2==\"M\") phe=0+x[$1]; else phe=\"NA\"; print $0,phe}' $pfx.Y_loss.lines - | \\\n  awk -F\"\\t\" -v OFS=\"\\t\" 'NR==FNR {x[$1]=1} NR\u003eFNR {if ($2==\"F\" || $2==\"K\") phe=0+x[$1]; else phe=\"NA\"; print $0,phe}' $pfx.X_loss.lines - | \\\n  awk -F\"\\t\" -v OFS=\"\\t\" 'NR==FNR {x[$1]=1} NR\u003eFNR {if ($2==\"F\" || $2==\"K\") phe=0+x[$1]; else phe=\"NA\"; print $0,phe}' $pfx.X_loss_high.lines - | \\\n  awk -F\"\\t\" -v OFS=\"\\t\" 'NR==FNR {x[$1]=1} NR\u003eFNR {if ($2==\"F\" || $2==\"K\") phe=0+x[$1]; else phe=\"NA\"; print $0,phe}' $pfx.X_loss_low.lines - | \\\n  cut -f1,3-) \u003e $pfx.pheno.tsv\n```\n\nInclude other mosaic chromosomal alteration events in the phenotype table\n```\nfor type in auto cnloh gain loss cll myeloid lymphoid {{{1..12},{16..20}}p,{1..22}q}_{cnloh,loss} {1..22}_gain; do\n  n=$(cat $pfx.$type.lines | wc -l);\n  if [ \"$n\" -gt 0 ]; then\n    awk -F\"\\t\" -v OFS=\"\\t\" -v type=$type 'NR==FNR {x[$1]=1}\n      NR\u003eFNR {if (FNR==1) col=type; else col=0+x[$1]; print $0,col}' \\\n      $pfx.$type.lines $pfx.pheno.tsv | sponge $pfx.pheno.tsv\n  fi\ndone\n```\n\nCompute allelic shift\n=====================\n\nSome mosaic chromosomal alterations have been observed to affect preferentially some haplotypes causing biased allelic shifts at several loci (MPL, FH, NBN, JAK2, FRA10B, MRE11, ATM, SH2B3, TINF2, TCL1A, DLK1, TM2D3, CTU2). This type of analysis for DNA microarray data requires information about the mosaic chromosomal alterations to be extended across imputed heterozygous genotypes and a binomial test for biased allelic shift to be performed, as has been done before (see Table 1 of [Loh et al. 2018](http://doi.org/10.1038/s41586-018-0321-x), Extended Data Table 1 of [Loh et al. 2020](http://doi.org/10.1038/s41586-020-2430-6), and Table 1 of [Terao et al. 2020](http://doi.org/10.1038/s41586-020-2426-2))\n\nImport allelic shift information from the MoChA output VCF into a VCF file with imputed genotypes (optional for array data)\n```\nbcftools annotate \\\n  --no-version \\\n  --output $dir/$pfx.imp.as.bcf \\\n  --output-type b \\\n  --columns FMT/AS \\\n  $dir/$pfx.imp.bcf \\\n  --annotations $dir/$pfx.as.bcf \\\n  --write-index\n```\n\nRun asymmetry analyses (subset cohort, run binomial test, discard genotypes)\n```\nbcftools +extendFMT \\\n  --no-version -Ou \\\n  --format AS \\\n  --phase \\\n  --dist 500000 \\\n  --regions $reg \\\n  --samples $lst \\\n  $dir/$pfx.imp.as.bcf | \\\nbcftools +mochatools \\\n  --no-version \\\n  --output $dir/$pfx.bal.bcf \\\n  --output-type b \\\n  -- --summary AS \\\n  --test AS \\\n  --drop-genotypes \\\n  --write-index\n```\n\nObserve results for asymmetry analyses in table format\n```\nfmt=\"%CHROM\\t%POS\\t%ID\\t%AS{0}\\t%AS{1}\\t%binom_AS\\n\"\nbcftools query \\\n  --include \"binom_AS\u003c1e-6\" \\\n  --format \"$fmt\" \\\n  $dir/$pfx.as.bcf | \\\n  column -ts $'\\t'\n```\n\nPlot results\n============\n\nInstall basic tools (Debian/Ubuntu specific if you have admin privileges)\n```\nsudo apt install r-cran-optparse r-cran-ggplot2 r-cran-data.table r-cran-reshape2\n```\n\nDownload R scripts\n```\n/bin/rm -f $HOME/bin/{summary,pileup,mocha}_plot.R\nwget -P $HOME/bin http://raw.githubusercontent.com/freeseek/mocha/master/{summary,pileup,mocha}_plot.R\nchmod a+x $HOME/bin/{summary,pileup,mocha}_plot.R\n```\n\nGenerate summary plot\n```\nsummary_plot.R \\\n  --stats $dir/$pfx.stats.tsv \\\n  --calls $dir/$pfx.calls.tsv \\\n  --pdf $dir/$pfx.pdf\n```\n\nGenerate pileup plot\n```\npileup_plot.R \\\n  --cytoband $HOME/GRCh37/cytoBand.txt.gz \\\n  --stats $dir/$pfx.stats.tsv \\\n  --calls $dir/$pfx.calls.filtered.tsv \\\n  --pdf $dir/$pfx.pdf\n```\n\nIf you have a table with columns with encoded age information for each sample in columns headed with `sample_id` and `age`, you can plot the prevalence of events by age\n```\nage_plot.R \\\n  --stats $dir/$pfx.stats.tsv \\\n  --calls $dir/$pfx.calls.filtered.tsv \\\n  --age $dir/$pfx.age.tsv \\\n  --pdf $dir/$pfx.pdf\n```\nPrevalences will be separatedly plot among telomeric autosomal mCAs, interstitial autosomal mCAs, mLOX, and mLOY\n\nPlot mosaic chromosomal alterations (for array data)\n```\nmocha_plot.R \\\n  --mocha \\\n  --stats $dir/$pfx.stats.tsv \\\n  --vcf $dir/$pfx.as.bcf \\\n  --png MH0145622.png \\\n  --samples MH0145622 \\\n  --regions 11:81098129-115077367 \\\n  --cytoband $HOME/GRCh37/cytoBand.txt.gz\n```\nNotice that by default MoChA will perform internal BAF (for array data) and LRR adjustments. These adjustments will be computed by the plotter unless you use the option `--no-adjust`\n\n![](MH0145622.png)\nMosaic deletion from array data overlapping the ATM gene (GRCh37 coordinates). The deletion signal can be observed across LRR, BAF and phased BAF, although it is the most clear with the latter. Furthermore, evidence of three phase switch errors can be observed in the shifted phased BAF signal\n\nPlot mosaic chromosomal alterations (for WGS data)\n```\nmocha_plot.R \\\n  --wgs \\\n  --mocha \\\n  --stats $dir/$pfx.stats.tsv\n  --vcf $dir/$pfx.as.bcf \\\n  --png CSES15_P26_140611.png \\\n  --samples CSES15_P26_140611 \\\n  --regions 1:202236354-211793505 \\\n  --cytoband $HOME/GRCh37/cytoBand.txt.gz\n```\n\n![](CSES15_P26_140611.png)\nComplex duplication overlapping the MDM4 gene (GRCh37 coordinates). Signal over heterozygous sites colored in blue shows evidence of a triplication event and signal over heterozygous sites colored in red shows evidence of a duplication event. Multiple phase switch errors can be observed in the shifted phased BAF signal\n\nHMM parameters\n==============\n\nMoChA has a complicated list of parameters that it uses to assign likelihoods and transition probabilities:\n- [xy-major-pl] transition phred-scaled likelihood where the non-alternate state is towards the centromere\n- [xy-minor-pl] transition phred-scaled likelihood where the non-alternate state is away from the centromere\n- [auto-tel-pl] autosomal telomeres phred-scaled likelihood used to provide a prior for aneuploidy and CN-LOH events reaching the telomere\n- [chrX-tel-pl] chromosome X telomeres phred-scaled likelihood used to provide a prior for mLOX events\n- [chrY-tel-pl] chromosome Y telomeres phred-scaled likelihood used to provide a prior for mLOY events\n- [error-pl] uniform error phred-scaled likelihood used to maximize the amount of evidence a single site can provide in favor of an event\n- [flip-pl] phase flip phred-scaled likelihood used to allow flip between equivalent alternative states corresponding to different phases\n\nThe internal HMM used by MoChA is completely symmetrical, but it internally uses two separate transition likelihoods to alternate states, one when transitioning towards an alternate state away from the centromere and one towards the centromere. This allows to include a lower prior for events that span the centromere, with the exception of events that span a whole chromosome, such as mLOX and mLOY. Notice that these priors will only make a difference for events at very low cell fraction, so they should not be relevant if a study is only interested in high cell fraction events\n\nMoChA will also attempt to split events that span a whole chromosome by comparing the median LRR values on the small and long arms as these will occasionally represent isochromosome events\n\nThere are a few differences between MoChA and the HMM model used in [Loh et al. 2018](http://doi.org/10.1038/s41586-018-0321-x), [Thompson et al. 2019](http://doi.org/10.1038/s41598-020-59963-8), [Loh et al. 2020](http://doi.org/10.1038/s41586-020-2430-6), and [Terao et al. 2020](http://doi.org/10.1038/s41586-020-2426-2). The most important is that, the latter model used a likelihood ratio test statistic with likelihoods deriving from a 3-state forward-backward HMM model. MoChA instead uses a Viterbi HMM model with multiple alternate states and it increases the number of alternate states dynamically when trying to assess multiple calls. While MoChA's HMM transition probabilities are symmetrical with respect to centromeres, the HMM in Loh et al. use non-symmetrical transitions and different likelihoods for other parameters:\n- transition likelihood to the diploid state was 0.0003 (PL\\~35.2) on the autosome and 0.0001 (PL=40.0) on chromosome X\n- transition likelihood to an alternative state was 0.004 (PL\\~24.0) on the autosome and 0.08 (PL\\~11.0) on chromosome X\n- no penalty for telomeres likelihood (PL=0.0) except for acrocentric chromosomes where the penalty was 0.2 (PL\\~7.0)\n- flip error probability was 0.001 (PL=30.0)\nAnother difference is that MoChA performs a BAF correction by regressing BAF values against LRR values (similarly to what done in [Oosting et al\\. 2007](http://doi.org/10.1101/gr.5686107), [Staaf et al\\. 2008](http://doi.org/10.1186/1471-2105-9-409), and [Mayrhofer et al\\. 2016](http://doi.org/10.1038/srep36158)). A similar correction was separately performed in [Terao et al. 2020](http://doi.org/10.1038/s41586-020-2426-2)\n\nAcknowledgements\n================\n\nThis work is supported by NIH grant [R01 HG006855](http://grantome.com/grant/NIH/R01-HG006855), NIH grant [R01 MH104964](http://grantome.com/grant/NIH/R01-MH104964), NIH grant [R01MH123451](http://grantome.com/grant/NIH/R01-MH123451), US Department of Defense Breast Cancer Research Breakthrough Award W81XWH-16-1-0316 (project BC151244), and the Stanley Center for Psychiatric Research. This work would have not been possible without the efforts of Heng Li \u003clh3@sanger.ac.uk\u003e, Petr Danecek \u003cpd3@sanger.ac.uk\u003e, John Marshall \u003cjm18@sanger.ac.uk\u003e, James Bonfield \u003cjkb@sanger.ac.uk\u003e, and Shane McCarthy \u003csm15@sanger.ac.uk\u003e for writing HTSlib and BCFtools, Po-Ru Loh \u003cporuloh@broadinstitute.org\u003e for writing the Eagle phasing software, Olivier Delaneau \u003colivier.delaneau@gmail.com\u003e for writing the SHAPEIT5 phasing software, and Simone Rubinacci \u003crubinacci.simone@gmail.com\u003e for writing the IMPUTE5 imputation software\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffreeseek%2Fmocha","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffreeseek%2Fmocha","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffreeseek%2Fmocha/lists"}