{"id":13710398,"url":"https://github.com/mortazavilab/TALON","last_synced_at":"2025-05-06T19:31:00.120Z","repository":{"id":41001155,"uuid":"127962663","full_name":"mortazavilab/TALON","owner":"mortazavilab","description":"Technology agnostic long read analysis pipeline for transcriptomes","archived":false,"fork":false,"pushed_at":"2024-01-25T22:12:20.000Z","size":11996,"stargazers_count":136,"open_issues_count":22,"forks_count":31,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-11-13T20:40:45.638Z","etag":null,"topics":["oxford-nanopore","pacbio","transcript-quantification","transcriptome"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mortazavilab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-04-03T20:07:02.000Z","updated_at":"2024-11-06T00:18:39.000Z","dependencies_parsed_at":"2024-01-25T23:49:19.035Z","dependency_job_id":null,"html_url":"https://github.com/mortazavilab/TALON","commit_stats":null,"previous_names":[],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mortazavilab%2FTALON","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mortazavilab%2FTALON/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mortazavilab%2FTALON/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mortazavilab%2FTALON/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mortazavilab","download_url":"https://codeload.github.com/mortazavilab/TALON/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252753222,"owners_count":21798935,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["oxford-nanopore","pacbio","transcript-quantification","transcriptome"],"created_at":"2024-08-02T23:00:55.580Z","updated_at":"2025-05-06T19:30:55.081Z","avatar_url":"https://github.com/mortazavilab.png","language":"Python","funding_links":[],"categories":["Software packages"],"sub_categories":["Transcript discovery and quantification"],"readme":"# TALON\n\u003cimg align=\"left\" width=\"450\" src=\"figs/TALON.png\"\u003e\n\nTALON is a Python package for identifying and quantifying known and novel genes/isoforms\nin long-read transcriptome data sets. TALON is technology-agnostic in that it\nworks from mapped SAM files, allowing data from different sequencing platforms\n(i.e. PacBio and Oxford Nanopore) to be analyzed side by side.\n\n## Table of contents\n* [Installation](#installation)\n* [Running TALON](#how_to_run)\n  * [Flagging reads for internal priming](#label_reads)\n  * [Initializing a TALON database](#db_init)\n  * [Annotating reads with TALON](#run_talon)\n* [Working with the TALON results](#talon_utils)\n  * [Accessing abundance information](#talon_abundance)\n  * [Filtering transcript models](#talon_filter)\n  * [Creating gene / transcript-level AnnDatas](#talon_adata)\n* [Citing TALON](#talon_cite)\n\nReads must be aligned to the reference genome and oriented in the forward direction (5'-\u003e3') prior to using TALON. We recommend the Minimap2 aligner - please see their GitHub page [here](https://github.com/lh3/minimap2) for recommended long-read parameters by technology. Please note that TALON requires the SAM MD tag, so Minimap2 should be run with the --MD flag enabled. In principle, you can use any other long-read alignment software provided that an MD tag is generated.\n\nWe also recommend correcting the aligned reads with [TranscriptClean](https://github.com/mortazavilab/TranscriptClean) to fix artifactual noncanonical splice junctions, though this is not strictly necessary for TALON to run.\n\nTo learn more about how TALON works, please see our preprint in BioRxiv: https://www.biorxiv.org/content/10.1101/672931v1\n\n# \u003ca name=\"installation\"\u003e\u003c/a\u003eInstallation\nNewer versions of TALON (v4.0+) are designed to be run with Python 3.6+.\n\nTo install TALON, simply download the files using Github's \"Download ZIP\" button, then unzip them in the directory where you would like to store the program. Alternately, you can download a specific version of the program from the Releases tab.\n\nGo to the directory and run:\n```\npip install cython\npip install .\n```\nThis will install TALON. You can now run the commands from anywhere.\n\nNOTE: Talon versions 4.2 and lower are not installable. Check the README of those releases to see how you can run the scripts from the install directory, or visit the wiki [here](https://github.com/mortazavilab/TALON/wiki/Archived-TALON-documentation).\n\n\n# \u003ca name=\"how_to_run\"\u003e\u003c/a\u003eHow to run\nFor a small, self-contained example with all necessary files included, see https://github.com/mortazavilab/TALON/tree/master/example\n\n## \u003ca name=\"label_reads\"\u003e\u003c/a\u003eFlagging reads for internal priming\nCurrent long-read platforms that rely on poly-(A) selection are prone to internal priming artifacts. These occur when the oligo-dT primer binds off-target to A-rich sequences inside an RNA transcript rather than at the end. Therefore, we recommend running the **`talon_label_reads`** utility on each of your SAM files separately to record the fraction of As in the n-sized window immediately following each read alignment (reference genome sequence). The default n value is 20 bp, but you can adjust this to match the length of the T sequence in your primer if desired. The output of talon_label_reads is a SAM file with the fraction As recorded in the fA:f custom SAM tag. Non-primary alignments are omitted. This SAM file can now be used as your input to the TALON annotator.\n```\nUsage: talon_label_reads [options]\n\nOptions:\n  -h, --help            show this help message and exit\n  --f=SAM_FILE          SAM file of transcripts\n  --g=GENOME_FILE       Reference genome fasta file\n  --t=THREADS           Number of threads to run\n  --ar=FRACA_RANGE_SIZE\n                        Size of post-transcript interval to compute fraction\n                        As on. Default = 20\n  --tmpDir=TMP_DIR      Path to directory for tmp files. Default =\n                        tmp_label_reads\n  --deleteTmp           If this option is set, the temporary directory\n                        generated by the program will be removed at the end of\n                        the run.\n  --o=OUTPREFIX         Prefix for outfiles\n```\n\n## \u003ca name=\"db_init\"\u003e\u003c/a\u003eInitializing a TALON database\nThe first step in using TALON is to initialize a SQLite database from the GTF annotation of your choice (i.e. GENCODE). This step is done using **`talon_initialize_database`**, and only needs to be performed once for your analysis. Keep track of the build and annotation names you choose, as these will be used downstream when running TALON and its utilities.\n\nNOTE: The GTF file you use must contain genes, transcripts, and exons. If the file does not contain explicit gene and/or transcript entries, key tables of the database will be empty and you will experience problems in the downstream analysis. Please see our [GTF troubleshooting section](https://github.com/mortazavilab/TALON/wiki/Formatting-a-GTF-annotation-to-work-with-TALON) for help.\n\n```\nUsage: talon_initialize_database [options]\n\nOptions:\n  -h, --help           Show help message and exit\n  --f                  GTF annotation file\n  --g                  The name of the reference genome build that the annotation describes. Use a short and memorable name since you will need to specify the genome build when you run TALON later.\n  --a                  The name of the annotation (for metadata purposes)\n  --l                  Minimum required transcript length (default = 0 bp)\n  --idprefix           Prefix for naming novel discoveries in eventual TALON runs (default = 'TALON')\n  --5p                 Maximum allowable distance (bp) at the 5' end during annotation (default = 500 bp)\n  --3p                 Maximum allowable distance (bp) at the 3' end during annotation (default = 300 bp)\n  --o                  Output prefix for the database\n```\n\n## \u003ca name=\"run_talon\"\u003e\u003c/a\u003eRunning TALON\nNow that you've initialized your database and checked your reads for evidence of internal priming, you're ready to annotate them. The input database is modified in place to track and quantify transcripts in the provided dataset(s). In a talon run, each input SAM read is compared to known and previously observed novel transcript models on the basis of its splice junctions. This allows us to not only assign a novel gene or transcript identity where appropriate, but to track new transcript models and characterize how they differ from known ones. The types of novelty assigned are shown in this diagram.\n\u003cimg align=\"left\" width=\"450\" src=\"figs/novelty.png\"\u003e\n\nTo run the **`talon`** annotator, create a comma-delimited configuration file with the following four columns: name, sample description, platform, sam file (full path). There should be one line for each dataset, and dataset names must be unique. If you decide later to add more datasets to an existing analysis, you can do so by creating a new config file for this data and running TALON again on the existing database.\n\nIf you're using the `--cb` option, the dataset names will be pulled from the SAM CB tag, making the first column of the config file unnecessary. Accordingly, TALON expects that when the `--cb` tag is provided, the config file only includes the following: sample description, platform, sam file (full path).\n\nPlease note that TALON versions 4.4+ can be run in multithreaded fashion for a much faster runtime.\n\n```\nusage: talon [-h] [--f CONFIG_FILE] [--cb] [--db FILE,] [--build STRING,]\n             [--threads THREADS] [--cov MIN_COVERAGE]\n             [--identity MIN_IDENTITY] [--nsg] [--o OUTPREFIX]\n\noptional arguments:\n  -h, --help            show this help message and exit  \n  --f CONFIG_FILE       Dataset config file: dataset name, sample description,\n                        platform, sam file (comma-delimited)  \n  --db FILE,            TALON database. Created using\n                        talon_initialize_database\n  --cb                  Use cell barcode tags to determine dataset. Useful for\n                        single-cell data. Requires 3-entry config file.\n  --build STRING,       Genome build (i.e. hg38) to use. Must be in the\n                        database.\n  --threads THREADS, -t THREADS\n                        Number of threads to run program with.\n  --cov MIN_COVERAGE, -c MIN_COVERAGE\n                        Minimum alignment coverage in order to use a SAM\n                        entry. Default = 0.9\n  --identity MIN_IDENTITY, -i MIN_IDENTITY\n                        Minimum alignment identity in order to use a SAM\n                        entry. Default = 0.8\n  --nsg, --create_novel_spliced_genes\n                        Make novel genes with the intergenic novelty label for\n                        transcripts that don't share splice junctions with any\n                        other models\n  --tmpDir\n                        Path to directory for tmp files. Default = `talon_tmp/`\n  --o OUTPREFIX         Prefix for output files\n\n```\nTALON generates two output files in the course of a run. The QC log (file with suffix **`'QC.log'`**) is useful for tracking why a particular read was or was not included in the TALON analysis.\n\u003cdetails\u003e\n\u003csummary\u003eQC log format\u003c/summary\u003e  \n\nColumns:  \n1. dataset  \t\n2. read_ID  \t\n3. passed_QC (1/0)  \t\n4. primary_mapped (1/0)  \n5. read_length\n6. fraction_aligned\n7. Identity\n\n\u003c/details\u003e\n\nThe second output file (suffix **`'read_annot.tsv'`**) appears at the very end of the run and contains a line for every read that was successfully annotated.\n\u003cdetails\u003e\n\u003csummary\u003eRead annotation file format\u003c/summary\u003e\n\nColumns:  \n1. Name of individual read  \n2. Name of dataset the read belongs to  \n3. Name of genome build used in TALON run  \n4. Chromosome  \n5. Read start position (1-based). This refers to the 5' end start, so for reads on the - strand, this number will be larger than the read end (col 6).  \n6. Read end position (1-based). This refers to the 3' end stop, so for reads on the - strand, this will be smaller than the read start (col 5).  \n7. Strand (+ or -)  \n8. Number of exons in the transcript  \n9. Read length (soft-clipped bases not included)  \n10. Gene ID (issued by TALON, integer)  \n11. Transcript ID (issued by TALON, integer)  \n12. Annotation gene ID\n13. Annotation transcript ID\n14. Annotation gene name (human-readable gene symbol)  \n15. Annotation transcript name (human-readable transcript symbol)  \n16. Gene novelty: one of \"Known\", \"Antisense\", or \"Intergenic\".   \n17. Transcript novelty: one of \"Known\", \"ISM\", \"NIC\", \"NNC\", \"Antisense\", \"Intergenic\", or \"Genomic\".   \n18. ISM subtype. If transcript novelty is not ISM, this field will be 'None'. If the transcript is an ISM, then this field can be 'Prefix', 'Suffix', 'Both', or 'None'.    \n19. fraction_As: From the talon_label_reads step. Records the fraction of As present in the n bases after the read alignment.  \n20. Custom_label: If the user provided a custom SAM flag (lC:Z), the value will be shown here.    \n21. Allelic_label: If the user provided a custom SAM flag (lA:Z) to denote which allele a read came from, the value will be shown here.    \t\n22. Start_support: If the user provided a custom SAM flag (tS:Z) to denote external evidence for the start site of a read, the value will be shown here.   \n23. End_support: If the user provided a custom SAM flag (tE:Z to denote external evidence for the end site of a read, the value will be shown here.  \n\n\u003c/details\u003e\n\nIt is also possible to obtain this file from a TALON database at any time by running the **`talon_fetch_reads`** utility.\n```\nUsage: talon_fetch_reads [-h] [--db FILE,] [--build STRING,]\n                         [--datasets STRING,] [--o OUTPREFIX]\n\noptional arguments:\n  -h, --help          show this help message and exit\n  --db FILE,          TALON database\n  --build STRING,     Genome build (i.e. hg38) to use. Must be in the\n                      database.\n  --datasets STRING,  Optional: Comma-delimited list of datasets to include.\n                      Default behavior is to include all datasets in the\n                      database.\n  --o OUTPREFIX       Prefix for output files\n```\n\n# \u003ca name=\"talon_utils\"\u003e\u003c/a\u003eWorking with the TALON results\n\n## \u003ca name=\"talon_abundance\"\u003e\u003c/a\u003eAccessing abundance information\n\nThe **`talon_abundance`** module can be used to extract a raw or filtered transcript count matrix from your TALON database. Each row of this file represents a transcript detected by TALON in one or more of your datasets. To generate a file suitable for gene expression analysis, skip the --whitelist option (i.e. make an unfiltered abundance file). To generate a file for isoform-level analysis, please see the next section to generate a whitelist file to use.\n```\nUsage: talon_abundance [options]\n\nOptions:\n  -h, --help            show this help message and exit\n  --db=FILE             TALON database\n  -a ANNOT, --annot=ANNOT\n                        Which annotation version to use. Will determine which\n                        annotation transcripts are considered known or novel\n                        relative to. Note: must be in the TALON database.\n  --whitelist=FILE      Whitelist file of transcripts to include in the\n                        output. First column should be TALON gene ID,\n                        second column should be TALON transcript ID\n  -b BUILD, --build=BUILD\n                        Genome build to use. Note: must be in the TALON\n                        database.\n  -d FILE, --datasets=FILE\n                        Optional: A file indicating which datasets should be\n                        included (one dataset name per line). Default is to\n                        include                   all datasets.\n  --o=FILE              Prefix for output file\n```\n\nPlease note to run this utility, you must provide genome build (-b) and annotation (-a) names that match those provided for the talon_initialize_database, otherwise it will not run.\n\n\u003cdetails\u003e\n\u003csummary\u003eAbundance file format\u003c/summary\u003e\n\nThe columns in the abundance file are as follows:\n1. TALON gene ID\n2. TALON transcript ID\n3. Gene ID from your annotation of choice. If the gene is novel relative to that annotation, this will be the TALON-derived name.\n4. Transcript ID from your annotation of choice. If the transcript is novel relative to that annotation, this will be the TALON-derived name.\n5. Gene name from your annotation of choice (makes the file a bit more human-readable!). If the transcript is novel relative to that annotation, this will be the TALON-derived name.\n6. Transcript name from your annotation of choice. If the transcript is novel relative to that annotation, this will be the TALON-derived name.\n7. Number of exons in the transcript\n8. Length of transcript model (nucleotides). Note: For known transcripts, this will be the length of the model as defined in the annotation. The actual reads that matched it may not be that length. For actual read lengths, see the read_annot output file.\n9. Gene novelty (Known, Antisense, Intergenic)\n10. Transcript status (Known, ISM, NIC, NNC, Antisense, Intergenic)\n11. ISM subtype (Both, Prefix, Suffix, None)  \n**---------------------------- Remaining columns -----------------------------**  \nOne column per dataset, with a count indicating how many times the current transcript was observed in that dataset.\n\n\u003c/details\u003e\n\n## \u003ca name=\"talon_filter\"\u003e\u003c/a\u003eFiltering your transcriptome for isoform-level analysis\n\nBefore quantifying your results on the isoform level, it is important to filter the novel transcript models because long-read platforms are prone to several forms of artifacts. The most effective experimental design for filtering is to use biological replicates. Some limited filtering is possible even for singlet datasets, but keep in mind that this is likely to be far less effective.\n\nThe **`talon_filter_transcripts`** module generates a whitelist of transcripts that are either:  \na) Known  \nb) Observed at least n times in each of k datasets.  \nThe default value for n is 5 and the default for k is the total number of datasets you provide for filtering. In addition, the filter requires that all n reads used to support a novel transcript must not have evidence of internal priming (default: internal priming defined as \u003e 0.5 fraction As). If you wish to disregard internal priming, set --maxFracA to 1 (not generally recommended).\n```\nUsage: talon_filter_transcripts [options]\n\nOptions:\n  -h, --help            show this help message and exit\n  --db=FILE             TALON database\n  -a ANNOT, --annot=ANNOT\n                        Which annotation version to use. Will determine which\n                        annotation transcripts are considered known or novel\n                        relative to. Note: must be in the TALON database.\n  --includeAnnot        Include all transcripts from the annotation, regardless\n                        of if they were observed in the data.\n  --datasets=DATASETS   Datasets to include. Can be provided as a comma-\n                        delimited list on the command line, or as a file with\n                        one dataset per line. If this option is omitted, all\n                        datasets will be included.\n  --maxFracA=MAX_FRAC_A\n                        Maximum fraction of As to allow in the window located\n                        immediately after any read assigned to a novel\n                        transcript (helps to filter out internal priming\n                        artifacts). Default = 0.5. Use 1 if you prefer to not\n                        filter out internal priming.\n  --minCount=MIN_COUNT  Number of minimum occurrences required for a novel\n                        transcript PER dataset. Default = 5\n  --minDatasets=MIN_DATASETS\n                        Minimum number of datasets novel transcripts must be\n                        found in. Default = all datasets provided\n  --allowGenomic        If this option is set, transcripts from the Genomic\n                        novelty category will be permitted in the output\n                        (provided they pass the thresholds). Default behavior\n                        is to filter out genomic transcripts since they are\n                        unlikely to be real novel isoforms.\n  --o=FILE              Outfile name\n\n```\nThe columns in the resulting output file are:\n1. TALON gene ID (an integer). This is the same type of ID found in column 1 of TALON abundance files.\n2. TALON transcript ID (an integer). This is the same type of ID found in column 2 of TALON abundance files.\n\n## Obtaining a custom GTF transcriptome annotation from a TALON database\n\nYou can use the **`talon_create_GTF`** utility to extract a GTF-formatted annotation from the TALON database.\n```\nUsage: talon_create_GTF [options]\n\nOptions:\n  -h, --help            show this help message and exit\n  --db=FILE             TALON database\n  -b BUILD, --build=BUILD\n                        Genome build to use. Note: must be in the TALON\n                        database.\n  -a ANNOT, --annot=ANNOT\n                        Which annotation version to use. Will determine which\n                        annotation transcripts are considered known or novel\n                        relative to. Note: must be in the TALON database.\n  --whitelist=FILE      Whitelist file of transcripts to include in the\n                        output. First column should be TALON gene ID,\n                        second column should be TALON transcript ID\n  --observed            If this option is set, the GTF file will only\n                        include transcripts that were observed in at least one\n                        dataset (redundant if dataset file provided).\n  -d FILE, --datasets=FILE\n                        Optional: A file indicating which datasets should be\n                        included (one dataset name per line). Default is to\n                        include                   all datasets.\n  --o=FILE              Prefix for output GTF\n```\n\nPlease note to run this utility, you must provide genome build (-b) and annotation (-a) names that match those provided for the `talon_initialize_database`, otherwise it will not run.\n\n## \u003ca name=\"talon_adata\"\u003e\u003c/a\u003eCreating a TALON AnnData object\n\nFor users that have single-cell data or that prefer to use the [AnnData format](https://anndata.readthedocs.io/en/latest/) to access abundance information, the **`talon_create_adata`** utility can be run. This utility produces an AnnData with counts information in sparse matrix format for each transcript, so it is also helpful if the abundance files start to get very large.\n\n```\nUsage: talon_create_adata [options]\n\nOptions:\n  -h, --help            show this help message and exit\n  --db=FILE             TALON database\n  -a ANNOT, --annot=ANNOT\n                        Which annotation version to use. Will determine which\n                        annotation transcripts are considered known or novel\n                        relative to. Note: must be in the TALON database.\n  --pass_list=FILE      Pass list file of transcripts to include in the\n                        output. First column should be TALON gene ID,\n                        second column should be TALON transcript ID\n  -b BUILD, --build=BUILD\n                        Genome build to use. Note: must be in the TALON\n                        database.\n  --gene                Output AnnData on the gene level rather than the\n                        transcript\n  -d FILE, --datasets=FILE\n                        Optional: A file indicating which datasets should be\n                        included (one dataset name per line). Default is to\n                        include all datasets.\n  --o=FILE              Output .h5ad file name\n```\n\n# \u003ca name=\"talon_cite\"\u003e\u003c/a\u003eCiting TALON\nPlease cite our preprint when using TALON:  \n\n*A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification.*\nDana Wyman, Gabriela Balderrama-Gutierrez, Fairlie Reese, Shan Jiang, Sorena Rahmanian, Weihua Zeng, Brian Williams, Diane Trout, Whitney England, Sophie Chu, Robert C. Spitale, Andrea J. Tenner, Barbara J. Wold, Ali Mortazavi\nbioRxiv 672931; doi: https://doi.org/10.1101/672931\n\n# License\nMIT, see LICENSE\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmortazavilab%2FTALON","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmortazavilab%2FTALON","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmortazavilab%2FTALON/lists"}