{"id":21015695,"url":"https://github.com/gersteinlab/aloft","last_synced_at":"2025-05-15T05:32:29.536Z","repository":{"id":146424318,"uuid":"10828244","full_name":"gersteinlab/aloft","owner":"gersteinlab","description":"ALOFT, the Annotation Of  Loss-of-Function Transcripts, provides extensive functional annotations to loss-of-function variants in the human genome.","archived":false,"fork":false,"pushed_at":"2019-11-04T01:27:16.000Z","size":30411,"stargazers_count":18,"open_issues_count":1,"forks_count":3,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-03-26T12:26:38.601Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://aloft.gersteinlab.org/","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gersteinlab.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2013-06-20T19:56:50.000Z","updated_at":"2021-07-30T19:41:06.000Z","dependencies_parsed_at":null,"dependency_job_id":"311fee87-862f-4762-9673-fe945aa133ad","html_url":"https://github.com/gersteinlab/aloft","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gersteinlab%2Faloft","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gersteinlab%2Faloft/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gersteinlab%2Faloft/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gersteinlab%2Faloft/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gersteinlab","download_url":"https://codeload.github.com/gersteinlab/aloft/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225332381,"owners_count":17457710,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-19T10:10:59.580Z","updated_at":"2024-11-19T10:11:00.154Z","avatar_url":"https://github.com/gersteinlab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"###############################################################################\n \n                 ALoFT Annotation of Loss-of-Function Transcripts\n                               README DOCUMENTATION\n \n               Version 1.0                           Gerstein Lab\n              Released 2013                Molecular Biophysics \u0026 Biochemistry\n                                                    Yale University\nCitation: \nContact: Dr. Suganthi Balasubramanian (suganthibala@gmail.com)\n         Mayur Pawashe (mpawashe@gmail.com)\n         Jeremy Liu (jeremy.liu@yale.edu)\n\n###############################################################################\n\n\nTable of Contents\nA. Preface\nB. Description\nC. System Requirements\nD. Usage\nE. Input Files\nF. Output Files\nG. ALoFT Features in VCF Output\nH. ALoFT Features in Tabbed Delineated Output\nI. Command Line Options\n\n\nA. Preface\nInstallation instructions can be found in INSTALL.\nQuick Start instructions can be found in QUICK START.\n\n\nB. Description\nALoFT takes as input an VCF (variant call format) file, runs it through the \nvariant annotation tool (VAT) tools snpMapper and indelMapper, and runs \nthe sorted VAT output through aloft, which calculates other variant-\nspecific features that give functional, evolutionary, mismapping, and other\ninformation. ALoFT only calculates these parameters for frameshift indels, \nloss-of-function (LoF) SNPs, and for SNPs located in splice sites. ALoFT \nis split into separate modules based on features.\n\n\nC. System Requirements\nPython 2.7.x or Python 3.x\nLinux (64-bit) or OSX 10.6 (64-bit) and up\n\nD. Input Files (Required):\n1) VCF file containing unannotated variants, passed in using the --vcf option.\nAlternatively, an annotated VCF file output by a separate manual run of VAT \nshould be passed in with the --vat option. See options section below.\n\n2) Reference files (defaults already in place after installation). \nFor complete list, see section below regarding options and default values.\n\n\nE. Output Files:\nAn output directory may be specified with the --output option.\nOtherwise the output files will be written in the ./aloft_output directory.\n\nFour files (may|will) be created in this output directory:\n\n1) A vcf file that VAT generates if --vcf is supplied called vat_output.vcf.\nThis intermediate file is then run through aloft to produce the following \nthree files. The variants in this file are listed in order, starting with\nthe lowest position on chromosome 1.\n\n2) Output VCF file named \u003cinput_file_name\u003e.aloft.vcf\nContains a culled list of formatted information calculated by VAT and \nfeatures calculated by ALoFT in Variant Call Format for both lof variants\nand spice site variants.\n\n3) Tab-delimited file named \u003cinput_file_name\u003e.aloft.lof\nContains the subset of VAT output associated with LoF SNPs and frameshift \nindels. All information from the VAT file is included, plus ALL features \ncalculated by aloft for LoF variants.\nVariants in the input file that intersect coding exons are included.\nFor variants with multiple transcripts, the variant data is calculated and \noutput on different lines for each transcript.\n\n4) Tab-delimited file named \u003cinput_file_name\u003e.aloft.splice\nContains the subset of VAT output associated with splice site SNPs.  \nAll information from the VAT file is included, plus ALL features \ncalculated by aloft for splice site SNPs.  Variants with multiple \ntranscripts are output on multiple rows, one for each combination of \nalternate allele and affected transcript.\nVariants in the input file that intersect splice sites are included.\n\n\nF. Usage\nALoFT can be invoked as follows:\n $ cd aloft\n $ ./aloft --vcf=/path/to/file --data=path/to/data/dir --output=path/to/dir [--option4=arg1]...\n\nExample:\n $ ./aloft --vcf=vcf_file.vcf\t(unannotated vcf file)\nOR \n $ ./aloft --vat=vat_file.vcf\t(VAT annotated vcf file)\nOR \n $ ./aloft --vcf=vcf_file.vcf --output=/path/to/directory   \n   (VAT annotated vcf file and custom output destination)\n\n\nG. ALoFT Features in VCF Output\nALoFT retains input file VCF metaheader and variant information and details.\nALoFT appends a subset of the features listed in Tabbed Delineated Output\ninto vcf form.\n\n1) The following features are listed for all variants (lof and splice):\n- AA, Ancestral, AF, AMR_AF, ASN_AF, AFR_AF, EUR_AF, VT, SNPSOURCE, AC, AF,\nAN AVGPOST, ERATE, LDAF, RSQ, THETA, VA are annotations from VAT.\nThese should be listed and described in the first portion of the VCF metadata.\n- Ancestral: Determines whether ancestral allele is the same as reference\n[VERIFY THIS]\n- GERPscore: Gives the GERP score associated with the variant position\n- SegDup: Gives the number of segmental duplications associated with the \nvariant position.\n- 1000GPhase1(|_AF|_ASN_AF|_AFR_AF|_EUR_AF): 1000Genomes Phase 1 allele freqs\n\tblank: Yes or No if associated 1000 Genomes allele frequencies\n\tAF: overall allele frequences\n\tASN_AF: asian subset allele frequences\n\tAFR_AF: african subset allele frequences\n\tEUR_AF: european subset allele frequencies\n- ESP6500(|_AAF): ESP6500 allele frequences\n\tblank: Yes or No if associated ESP6500 allele frequencies\n\tAAF: ancestral allele frequencies\n- GERPelement: YES if variant/transcript has associated GERPelement and\nno otherwise.\n- exoncounts: Gives the number of exons in the transcript and the number\nof exons truncated by the premature stop, inclusive. For splice site variants\nthe number of exons truncated is replaced with the string \"NA\".\n\n2) These features are listed for lof variants, in addition to those in 1):\n- nearstart: YES if variant in first coding exon, NO otherwise.\n- nearend: YES if variant in last coding exon, NO otherwise.\n- canonical: YES if the 5' flanking splice site and the 3' flanking splice \nsite are both canonical, NO otherwise.\n- XX/XX: \u003c5' flanking splice site\u003e:\u003c3' flanking splice site of the exon that\nthe variant intersect, as reported in the reference genome.\n- lofposition: calculated one indexed coding sequence position in which the\npremature stop occurs in the alternate sequence.\n- nmd: YES if the premature stop leads to nonsense mediated decay, NO if not.\n- lof_anc: YES if ancestral allele alternate sequence leads to premature stop\nand NO otherwise.\n- heavilyduplicated: YES if the number of duplicated regions (that the variant\nis in) is high, NO otherwise\n- disorder_prediction: Gives the percentage of residues that are disordered in\nthe reference sequence and the percetageof residues that are disordered in the\ntruncated alternate sequence (after the premature stop) as \u003c%\u003e:\u003c%\u003e. If the\ntranscript is not associated with disordered regions, \".\" is output.\n- PF: PFAM protein domains. YES if variant intersects region, NO if no \nintersection and NA if no regions exist for the particular transcript.\n- SSF: SSF protein domains. YES if variant intersects region, NO if no \nintersection and NA if no regions exist for the particular transcript.\n- SM: SM protein domains. YES if variant intersects region, NO if no \nintersection and NA if no regions exist for the particular transcript.\n- Tmhmm: Transmembrane helix domains. YES if variant intersects region, NO if\nno intersection and NA if no regions exist for the particular transcript.\n- Sigp: Signal peptide domains. YES if variant intersects region, NO if no \nintersection and NA if no regions exist for the particular transcript.\n- PTM: Post translational modifications. YES if variant intersects any \npost translational modification regions, NO if no intersections and NA \nif no regions exist for the particular transcript.\n\n3) These features are listed for splice variants, in addition to those in 2):\n- XX/XX: \u003cacceptor_site\u003e:\u003cdonor_site\u003e splice sites in reference genome. \n- is_canonical: YES if variant intersecting splice site is canonical, NO if not\n- other_canonical: YES if other splice site in the intron, not the splice site\nintersected by the variant is canonical, NO otherwise.\n- intron_length: Gives the length of the intron that the varint intersects the\nsplice site.\n- small_intron: YES if the length of intron is less than 15bp, NO otherwise.\n- heavily_duplicated: YES if the variant region is heavily duplicated, NO\notherwise.\n- lof_anc: YES if the ancestral alternate sequence leads to a premature stop\nand NO otherwise.\n- alternate_acceptor_site: YES if there are potential neighboring splice sites\nthat could replace a malfunctioning splice site at the variant location and\nNO otherwise. This is the NAGNAG case.\n\n\nH. ALoFT Features in Tabbed Delineated Output\n1) ALOFT calculates the following features for all variants:\n- VAT Features: includes all features from VAT snpMapper and indelMapper\nThis includes allele frequencies, variant type, etc. This is the \"details\"\ncolumn in the tabbed delineated output and the first part of the details\nsection of each transcript in the vcf output.\n- partial/full: full if all transcripts of the affected gene are affected, \npartial otherwise\n- transcript length: length of affected transcript in nucleotides\n- longest transcript?: YES if transcript is the longest transcript affected by \nvariant, NO otherwise\n- shortest path to recessive gene: minimum length of a shortest path to a \nrecessive gene in protein interaction network\n- recessive neighbors: gives the gene id of the closest recessive gene\n[THIS NEEDS VERIFICATION]\n- shortest path to dominant gene: minimum of length of a shortest path to a \ndominant gene in the protein interaction network\n- dominant neighbors: gives the gene id of the closest dominant gene\n[THIS NEEDS VERIFICATION]\n- GERP score: associated GERP score of the variant position\n- GERP element: associated GERP element of the variant position\n- GERP rejection: associated GERP rejection score of the variant position\n- exon counts: number of exons associated with the variant transcript\n- Segmental duplications: Gives the position of associated segdups as a \nbracketed list, or a period if none exist.\n- 1000GPhase1(|_AF|_ASN_AF|_AFR_AF|_EUR_AF): 1000Genomes Phase 1 allele freqs\n\tblank: Yes or No if associated 1000 Genomes allele frequencies\n\tAF: overall allele frequences\n\tASN_AF: asian subset allele frequences\n\tAFR_AF: african subset allele frequences\n\tEUR_AF: european subset allele frequencies\n- ESP6500(|_AAF): ESP6500 allele frequences\n\tblank: Yes or No if associated ESP6500 allele frequencies\n\tAAF: ancestral allele frequencies\n- # pseudogenes associated to transcript: \nnumber of pseudogenes or a period if none.\n- # paralogs associated to gene: number of paralogs or a period if needed\n- dN/dS (macaque): Evolutionary score in comparison to macaque species\n- dN/dS (mouse): Evolutionary score in comparison to mouse species\n\n2) ALoFT calculates the following features for lof snps and indels:\n- is single coding exon?: YES if the variant intersects a transcript with \nonly one coding exon, NO if the variant does not.\n- indel position in CDS: gives the one indexed position of the indel in \nthe coding sequence\n- stop position in CDS: gives the one indexed position of the premature stop\nin the coding sequence\n- causes NMD?: YES if the variant causes nonsense mediated decay, calculated \nby default 50 base pair proximity of the premature stop to the last coding \nexon. NO if the variant does not lead to nonsense mediated decay.\n- 5' flanking splice site: Gives the upstream 5' splice site of the exon that\nthe variant intersects\n- 3' flanking splice site: Gives the downstream 3' splice site of the exon \nthat the variant intersects\n- canonical?: YES if the 5' flanking splice site is 'AG' and the 3' flanking \nsplice site is 'GT', NO otherwise if neither of the splice sites matches \n'AG' and 'GT', respectively.\n- # of failed filters: number of call filters failed associated with the \nloss of function variant\n- filters failed: list of the call filters failed associated with the variant\n\theavily_duplicated: if many segmental duplications exist\n\tlof_anc: the ancestral allele leads to loss of function\n\tnear_start: variant is in the first coding exon of the transcript\n- ancestral allele: Gives the nucleotide at the variant position in\nthe ancestral reference genome\n- Disorder prediction: Gives the percentage of disordered residues in the \ntranslated nucleotide sequence. Also gives the percentage of disordered\nresidues in the translated nucleotide sequence after the truncation caused \nby a premature stop. Or a . if variant does not have disorder regions.\n\n** Protein Families \u0026 Post Translational Modifications **\nFor the following features, \n\tthe region_id:count is output if variant intersects feature region\n\tNO_\u003cfeature\u003e is output if variant does not intersect feature regions\n\tNA_\u003cfeature\u003e is output if variant's transcript has no feature regions\n- PF and PFtrunacated: Determines whether the variant intersects or truncates \na PFAM protein segment.\n- SSF and SSFtruncated: Determines whether the variant intersects or \ntruncates a SFF protein segment.\n- SM and SMtruncated: Determines whether the variant intersects or truncates \na SM protein segment.\n- Tmhmm and Tmhmmtruncationed: Determines whether the variant intersects or \ntruncates a Tmhmm protein segment.\n- Sigp and Sigptruncated: Determines whether the variant intersects or\ntruncates a Sigp protein segment.\n- ACETYLATION(truncated), METHYLATION(truncated), PHOSPHORYLATION(truncated):\nPost translational modifications determined from Phophosite. \nDetermines whether the premature stop variant intersects or truncates \npost translational modification sites. \nSite types include acetylation, (mono/di/tri)methylation, O-GlcNAc, \nphophorylation, sumolyation, and ubiquitination. \n\n3) For splice SNPs only:\n- Donor: Gives the nucleotide sequence of the donor splice site\n- Acceptor: Gives the nucleotide sequence of the acceptor splice site\n- SNP in canonical site?: Determines whether the splice site that the SNP\nintersects is canonical (YES/NO)\n- Other splice site canonical?: Determines whether the other splice site,\nthat the SNP does not intersect, is canonical (YES/NO)\n- SNP location: Determines whether the snp intersects the donor or acceptor\n- Alt donor: Gives the nucleotide seqeunce of the donor splice site after\nthe SNP change has been made\n- Alt acceptor: Gives the nucleotide sequence of the acceptor splice site \nafter the SNP change has been made\n- NAGNAG positions: The NAGNAG case. Determines possible nearby canonical \nsplice sites to the SNP location. Alternative splice sites.\n- Intron length: Gives the length of the intron bracked by the donor and\nacceptor splice site.\n- # filters failed: number of call filters failed associated with the splice\nsite variant\n- Filters failed: list of the call filters failed associated with the variant\n\tref_noncanonical: reference splice site, that the variant intersects, \n\t\tis noncanonical\n\talt_noncanonical: alternate splice site, that the variant intersects, \n\t\tis noncanonical\n\tother_noncanonical: other reference splice site, that the variant does\n\t\tnot intersect, is noncanonical\n\theavily_duplicated: variant flagged if the variant intersects a \n\t\theavily duplicated region\n\tshort_intron: variant flagged if the intron is shorter than 15bp\n\n\nI. Options\naloft recognizes the following options for altering input and reference\nfiles (default values given):\nALOFT will come packaged with most of the necessary reference files.\n\n--version\nWill output the ALoFT version number.\n\n--vcf=\"\"\nSpecifies path to VCF input file.  Set to empty string by default.  If none \nspecified, ALoFT will try to skip VAT and run directly on the file \ngiven to the --vat option. This or --vat option is needed for proper execution.\n\n--vat=aloft_output/vat_output.vcf\nSpecifies path to VAT output file to run aloft on.\n\n--cache=cache/\nSpecifies path to directory containing cache of GERP score information and \nprotein-protein interaction information.  \nDirectory will be created if it doesn't already exist.\n\n--nmd_threshold=50\nDistance from premature stop to last exon-exon junction; used to predict\nNMD. Default distance is 50bp.\n\n--output=aloft_output/\nSpecifies path to tabbed output files and VCF file from ALoFT.\n\n—data=data/\nSpecifies path to data directory containing a data.txt file and other data dependencies.\ndata.txt contains paths to all data files that ALoFT requires.\nSee data/data.txt bundled with ALoFT for more information on these files.\n\n--verbose\nWill run ALoFT in verbose mode.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgersteinlab%2Faloft","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgersteinlab%2Faloft","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgersteinlab%2Faloft/lists"}