{"id":13752284,"url":"https://github.com/lh3/pangene","last_synced_at":"2025-05-09T00:05:44.363Z","repository":{"id":179572307,"uuid":"652343666","full_name":"lh3/pangene","owner":"lh3","description":"Constructing a pangenome gene graph","archived":false,"fork":false,"pushed_at":"2025-04-02T12:03:54.000Z","size":346,"stargazers_count":186,"open_issues_count":5,"forks_count":12,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-05-09T00:05:37.298Z","etag":null,"topics":["bioinformatics","pangenome"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh3.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-11T21:04:14.000Z","updated_at":"2025-05-07T13:03:23.000Z","dependencies_parsed_at":"2023-10-25T02:26:30.232Z","dependency_job_id":"7a5ebe24-c394-47aa-89f7-4ebea689a809","html_url":"https://github.com/lh3/pangene","commit_stats":null,"previous_names":["lh3/pangene"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fpangene","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fpangene/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fpangene/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fpangene/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh3","download_url":"https://codeload.github.com/lh3/pangene/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253166518,"owners_count":21864476,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","pangenome"],"created_at":"2024-08-03T09:01:03.025Z","updated_at":"2025-05-09T00:05:44.317Z","avatar_url":"https://github.com/lh3.png","language":"C","funding_links":[],"categories":["A list of software capable of analyzing mainly **eukaryotic** genomes for pangenomics.","Ranked by starred repositories"],"sub_categories":[],"readme":"## \u003ca name=\"started\"\u003e\u003c/a\u003eGetting Started\n```sh\n# Check prebuilt graphs at https://pangene.bioinweb.org\n\n# Install pangene\ngit clone https://github.com/lh3/pangene\ncd pangene \u0026\u0026 make\n\n# Alternatively, download the precompiled binaries for arm-mac and x64-linux\ncurl -L https://github.com/lh3/pangene/releases/download/v1.1/pangene-1.1-bin.tar.bz2|tar jxf -\n\n# The C4 example with provided alignment\n./pangene test/C4/*.paf.gz \u003e C4.gfa        # generate the graph\nk8 pangene.js call C4.gfa \u003e C4.bubble.txt  # identify bubbles\n\n# Deploy the GFA server on the C4 example; require pangene-1.1-bin\ncd pangene-1.1-bin                         # run gfa-server in this directory\nbin_arm64-mac/gfa-server /path/to/C4.gfa   # or use bin_linux-x64 on x64-linux\n# open \"http://127.0.0.1:8000/view?gene=C4A,C4B\" in a web browser\n\n# Deploy the GFA server on the human graph\nbin_arm64-mac/gfa-server -d html data/*.gfa.gz 2\u003e server.log \u0026\n# open \"http://127.0.0.1:8000\" in a web browser\n\n# Align proteins to each genome (general use cases; no example data)\nminiprot --outs=0.97 --no-cs -Iut16 genome1.fna proteins.faa \u003e genome1.paf\nminiprot --outs=0.97 --no-cs -Iut16 genome2.fna proteins.faa \u003e genome2.paf\n\n# Construct a pangene graph\npangene genome1.paf genome2.paf \u003e graph.gfa\n\n# Check manpage\nman ./pangene.1\n```\n\n## Table of Contents\n\n- [Getting Started](#started)\n- [Introduction](#intro)\n- [Graph Construction](#build)\n  - [Preparing a protein set](#prep-aa)\n  - [Aligning proteins to genomes](#align-aa)\n  - [Constructing a pangene graph](#build-graph)\n  - [Analyzing a graph](#analyze)\n- [Graph Visualization](#visual)\n- [Citation](#cite)\n- [Limitations](#limit)\n\n## \u003ca name=\"intro\"\u003e\u003c/a\u003eIntroduction\n\nPangene is a command-line tool to construct a pangenome gene graph. In this\ngraph, a node repsents a marker gene and an edge between two genes indicates\ntheir genomic adjaceny on input genomes. Pangene takes the [miniprot][mp]\nalignment between a protein set and multiple genomes and produces a graph in \nthe GFA format. It attempts to reduce the redundancy in the input proteins and\nfilter spurious alignments while preserving close but non-identical paralogs.\nThe output graph can be visualized in generic GFA viewers such as\n[BandageNG][bandage] or via a [web interface](#visual). Users can explore local\nhuman subgraphs at a [public server][server]. Prebuilt pangene graphs can be\nfound at [DOI:10.5281/zenodo.8118576][zenodo].\n\n\u003c!--\nBacterial pangenome tools such as [panaroo][panaroo] often leverage gene graphs\nto build bacterial pangenomes. Pangene is different in that it uses miniprot to\ninfer gene models. This makes pangene applicable to large Eukaryotic pangenomes\nand robust to imperfect gene annotations.\n--\u003e\n\n## \u003ca name=\"build\"\u003e\u003c/a\u003eGraph Construction\n\nPangene takes a list of protein-to-genome alignment as input. To generate\nthese alignments, you need to align the same set of proteins to multiple\ngenomes. How to choose the protein set can be tricky.\n\n### \u003ca name=\"prep-aa\"\u003e\u003c/a\u003ePreparing a protein set\n\nFor constructing a human pangene graph, the simplest choice is to use annotated\ngenes in GRCh38. It is highly recommended to name a protein sequence like\n`RGPD6:ENSP00000512633.1` where `RGPD6` is the gene name and\n`ENSP00000512633.1` is the unique protein identifier. In the output GFA, nodes\nare named after genes, so you would want to use human-readable gene names for\nvisualization later. You may use the following command line to extract protein\nsequences from Ensembl or GenCode annotation:\n```sh\nk8 pangene.js getaa gene-anno.gtf protein-seq.faa \u003e proteins.faa\n```\n\nWith pangene, different isoforms or diverged alleles of the same gene can be\npresent in the protein set, though in practice, we find selecting the canonical\nisoform per gene tends to give a cleaner graph probably possibly due to\nannotation errors among rare isoforms. For the GenCode annotation, use `getaa\n-c` to extract canonical isoforms only.\n\nFor a new species without good gene annotation, you may use protein annotations\nfrom a closely related species. You may pool proteins from multiple closely\nrelated species as well. Pangene aims to work with such input but this use case\nhas not been thoroughly carefully evaluated. Given a bacteria pangenome of the\nsame species, you may predict genes with existing tools, cluster them with\nCD-HIT or MMseqs2 and feed the representative protein in each cluster to\npangene.\n\n### \u003ca name=\"align-aa\"\u003e\u003c/a\u003eAligning proteins to genomes\n\nPangene currently only works with miniprot's PAF output. We usually use the\nfollowing command line:\n```sh\nminiprot --outs=0.97 --no-cs -Iut16 genomeX.fna proteins.faa \u003e genomeX.paf\n```\nFor bacterial data, add `-S` to disable splicing.\n\n### \u003ca name=\"build-graph\"\u003e\u003c/a\u003eConstructing a pangene graph\n\nThe following command-line constructs a pangene graph\n```sh\npangene *.paf \u003e graph.gfa\n```\nIf the output graph is cluttered in the Bandage viewer, you may add option\n`-a2` to filter out edges supported by a single genome. By default, pangene\nfilters out genes occurring in less than 5% of the genomes after deredundancy.\nIf you want to retain low-frequency genes, add `-p0` to disable the filter.\n\n### \u003ca name=\"analyze\"\u003e\u003c/a\u003eAnalyzing a graph\n\nThe GFA file is the master output. You can extract various information from\nthis file. You may find local gene-level variations with\n```sh\nk8 pangene.js call graph.gfa \u003e bubble.txt\n```\nor get the presence/absence of each gene with\n```sh\nk8 pangene.js gfa2matrix graph.gfa \u003e gene_presence_absence.Rtab\n```\n\n## \u003ca name=\"visual\"\u003e\u003c/a\u003eGraph Visualization\n\nYou can look at the entire graph in the Bandage GFA viewer. If you are\ninterested in a particular gene, it is best to set up gfa-server which is part\nof [gfatools][gfatools]. [Here][server] is a public server for human genes.\nYou can deploy this server on your machine with\n```sh\ncurl -L https://github.com/lh3/pangene/releases/download/v1.1/pangene-1.1-bin.tar.bz2|tar jxf -\ncd pangene-1.1-bin\nbin_arm64-mac/gfa-server -d html data/*.gfa.gz 2\u003e server.log # for Mac\n```\nThen you can open link `http://127.0.0.1:8000/` in your browser, type gene\nnames and visualize a local subgraph around the desired genes.\n\nA filled in coloured block represents a protein that is mapped to the genome by miniprot and scores above the set threshold. Empty blocks showing only the coloured frame may indicate stop codons in the reading frame (see Fig 6 in the arXiv preprint). Sensitivity to insertions into the genome is limited, so they may or may not disrupt the visualization. Genomic deletions are more clearly visible.\n\n## \u003ca name=\"cite\"\u003e\u003c/a\u003eCitation\n\nIf you use pangene in your work, please consider to cite:\n\n\u003e H Li, M Marin, MR Farhat (2024) Exploring gene content with pangenome gene graphs,\n\u003e arXiv:2402.16185 [link](https://arxiv.org/pdf/2402.16185)\n\n## \u003ca name=\"limit\"\u003e\u003c/a\u003eLimitations\n\n* Pangene only works with [miniprot][mp]'s PAF output.\n\n* In the output graph, arcs on W-lines may be absent from L-lines.\n\n[mp]: https://github.com/lh3/miniprot\n[bandage]: https://github.com/asl/BandageNG\n[gfatools]: https://github.com/lh3/gfatools\n[gfaview]: https://lh3.github.io/gfatools/\n[panaroo]: https://github.com/gtonkinhill/panaroo\n[asub]: https://github.com/lh3/asub\n[zenodo]: https://doi.org/10.5281/zenodo.8118576\n[server]: https://pangene.bioinweb.org\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fpangene","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh3%2Fpangene","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fpangene/lists"}