{"id":13752228,"url":"https://github.com/lh3/psmc","last_synced_at":"2025-05-07T08:12:45.263Z","repository":{"id":1337653,"uuid":"1283557","full_name":"lh3/psmc","owner":"lh3","description":"Implementation of the Pairwise Sequentially Markovian Coalescent (PSMC) model","archived":false,"fork":false,"pushed_at":"2022-11-21T04:39:31.000Z","size":100,"stargazers_count":166,"open_issues_count":53,"forks_count":60,"subscribers_count":18,"default_branch":"master","last_synced_at":"2025-05-07T08:12:38.875Z","etag":null,"topics":["bioinformatics","genomics","population-genetics"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh3.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2011-01-23T01:57:31.000Z","updated_at":"2025-04-21T03:19:37.000Z","dependencies_parsed_at":"2023-01-13T11:15:12.323Z","dependency_job_id":null,"html_url":"https://github.com/lh3/psmc","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fpsmc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fpsmc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fpsmc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fpsmc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh3","download_url":"https://codeload.github.com/lh3/psmc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252839296,"owners_count":21812090,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","genomics","population-genetics"],"created_at":"2024-08-03T09:01:01.845Z","updated_at":"2025-05-07T08:12:45.238Z","avatar_url":"https://github.com/lh3.png","language":"C","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"readme":"This software package infers population size history from a diploid sequence\nusing the Pairwise Sequentially Markovian Coalescent (PSMC) model. The\ndetailed model is described in file `psmc.tex'.\n\nTo compile the binaries, you may run\n\n    make; (cd utils; make)\n\nAfter that, you may try\n\n    utils/fq2psmcfa -q20 diploid.fq.gz \u003e diploid.psmcfa\n    psmc -N25 -t15 -r5 -p \"4+25*2+4+6\" -o diploid.psmc diploid.psmcfa\n    utils/psmc2history.pl diploid.psmc | utils/history2ms.pl \u003e ms-cmd.sh\n    utils/psmc_plot.pl diploid diploid.psmc\n\nwhere `diploid.fq.gz' is typically the whole-genome diploid consensus sequence\nof one human individual, which can be generated by, for example:\n\n    samtools mpileup -C50 -uf ref.fa aln.bam | bcftools view -c - \\\n      | vcfutils.pl vcf2fq -d 10 -D 100 | gzip \u003e diploid.fq.gz\n\nHere option -d sets and minimum read depth and -D sets the maximum. It is\nrecommended to set -d to a third of the average depth and -D to twice.  Program\n`fq2psmcfa' transforms the consensus sequence into a fasta-like format where\nthe i-th character in the output sequence indicates whether there is at least\none heterozygote in the bin [100i, 100i+100).\n\nFor dipcall output, you may use the following to generate psmcfa:\n\n    seqtk mutfa ref.fa \u003c(gzip -dc prefix.dip.vcf.gz|utils/vcf2snp.pl -) \\\n      | seqtk seq -cM prefix.dip.bed -l80 | utils/fq2psmcfa - \u003e prefix.psmcfa\n\nProgram `psmc' infers the population size history. In particular, the `-p'\noption specifies that there are 64 atomic time intervals and 28 (=1+25+1+1)\nfree interval parameters. The first parameter spans the first 4 atomic time\nintervals, each of the next 25 parameters spans 2 intervals, the 27th spans 4\nintervals and the last parameter spans the last 6 time intervals. The `-p' and\n`-t' options are manually chosen such that after 20 rounds of iterations, at\nleast ~10 recombinations are inferred to occur in the intervals each parameter\nspans. Impropriate settings may lead to overfitting. The command line in the\nexample above has been shown to be suitable for modern humans.\n\nThe `psmc' program infers the scaled mutation rate, the recombination rate and\nthe free population size parameters. All these parameters are scaled to 2N0. You\nmay run `psmc2history.pl' combined with `history2ms.pl' to generate the ms\ncommand line that simulates the history inferred by PSMC, or visualize the result\nwith `psmc_plot.pl'.\n\nTo perform bootstrapping, one has to run splitfa first to split long chromosome\nsequences to shorter segments. When the `-b' option is applied, psmc will then\nrandomly sample with replacement from these segments. As an example, the\nfollowing command lines perform 100 rounds of bootstrapping:\n\n    utils/fq2psmcfa -q20 diploid.fq.gz \u003e diploid.psmcfa\n\tutils/splitfa diploid.psmcfa \u003e split.psmcfa\n    psmc -N25 -t15 -r5 -p \"4+25*2+4+6\" -o diploid.psmc diploid.psmcfa\n\tseq 100 | xargs -i echo psmc -N25 -t15 -r5 -b -p \"4+25*2+4+6\" \\\n\t    -o round-{}.psmc split.fa | sh\n    cat diploid.psmc round-*.psmc \u003e combined.psmc\n\tutils/psmc_plot.pl -pY50000 combined combined.psmc\n\nOne probably wants to modify the \"xargs\" command-line to parallelize PSMC.\n\nIf you have questions about PSMC, please ask at \u003chttp://hengli.uservoice.com/\u003e.\nYou do not need to register unless you also want to modify your own questions.\nYou may also post comments at github (if you have a github account). I want to\nmake the question and the answer public such that others can see them and I do\nnot need to answer the same question multiple times. Thank you for using PSMC.\n\n\nAPPENDIX I: Scaling the PSMC output\n===================================\n\nThe PSMC output is scaled to the 2N_0. There are two ways of rescaling the time\nand the popuation size more meaningfully.\n\nFirstly, suppose we know the per-site per-generation mutation rate \\mu, we can\ncompute N_0 as:\n\n  N_0 = \\theta_0 / (4\\mu) / s\n\nwhere \\theta_0 is given at the 2nd column of \"TR\" lines, and s is the bin size\nwe use for generating the PSMC input. Knowing N_0, we can scale time to\ngenerations and relative population size to effective size by\n\n  T_k = 2N_0 * t_k\n  N_k = N_0 * \\lambda_k\n\nwhere t_k and \\lambda_k are given at the 3rd and 4th columns of \"RS\" lines,\nrespectively.\n\nA problem with the above strategy is that we do not know a definite answer of\n\\mu and in fact it various with regions and mutation types. An alternative way\nis to use per-site pairwise sequence divergence to represent time:\n\n  d_k = 2\\mu * T_k = t_k * \\theta_0 / s\n\nand use scaled mutation rate to represent population size:\n\n  \\theta_k = 4N_k * \\mu = \\lambda_k * \\theta_0 / s\n\nwhere, again, t_k and \\lambda_k are given at the \"RS\" line, \\theta_0 at the\n\"TR\" line and s is the bin size, which defaults to 100 in fq2psmcfa.\n\n\nAPPENDIX II: Correcting for low coverage\n========================================\n\nFor diploid genomes sequenced to low coverage, heterozygotes will be randomly\nlost due to the lack of coverage of both alleles. This has the same effect as\nsmaller mutation rate and can be corrected. If you know the fraction of hets\nmissed due to low coverage, you can generate the PSMC plot with:\n\n  psmc_plot.pl -M \"sample1=0.1,sample2=0.2\" prefix sample1.psmc sample2.psmc\n\nThis says that sample1 has 10% false negative rate (FNR) on hets and sample2\nhas 20%. The plotting script does not correct FNR for bootstrapping. If you\nwant to plot the result with your own scripts, you can increase \\theta_0 to\n\\theta_0/(1-FNR).\n\nUnfortunately, I haven't found a reliable way to estimate the background FNR\nrelevant to PSMC. The simple and unscientific approach is to align the PSMC\ncurves by eye. Probably a better solution is to downsample a high-coverage\nsample to a certain coverage and measures FNR. I have not done this.\n\nNot correcting for low coverage is the most common pitfall when using PSMC.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fpsmc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh3%2Fpsmc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fpsmc/lists"}