{"id":17652311,"url":"https://github.com/zyxue/kleat","last_synced_at":"2026-01-20T12:34:06.504Z","repository":{"id":141945640,"uuid":"138236988","full_name":"zyxue/kleat","owner":"zyxue","description":"Cleavage site prediction via de novo assembly","archived":false,"fork":false,"pushed_at":"2019-02-21T00:02:32.000Z","size":14645,"stargazers_count":0,"open_issues_count":12,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-07T03:48:37.799Z","etag":null,"topics":["alternative-polyadenylation","cleavage-sites","de-novo-assembly","rna-seq","transcriptome-assembly"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zyxue.png","metadata":{"files":{"readme":"README.rst","changelog":"HISTORY.rst","contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.rst","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-22T00:53:55.000Z","updated_at":"2019-02-20T23:55:55.000Z","dependencies_parsed_at":"2023-03-13T10:27:11.119Z","dependency_job_id":null,"html_url":"https://github.com/zyxue/kleat","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/zyxue/kleat","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zyxue%2Fkleat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zyxue%2Fkleat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zyxue%2Fkleat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zyxue%2Fkleat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zyxue","download_url":"https://codeload.github.com/zyxue/kleat/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zyxue%2Fkleat/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28603392,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-20T12:01:53.233Z","status":"ssl_error","status_checked_at":"2026-01-20T12:01:46.545Z","response_time":117,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alternative-polyadenylation","cleavage-sites","de-novo-assembly","rna-seq","transcriptome-assembly"],"created_at":"2024-10-23T11:46:31.365Z","updated_at":"2026-01-20T12:34:06.488Z","avatar_url":"https://github.com/zyxue.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"======\nKLEAT\n======\n\n\n.. image:: https://img.shields.io/pypi/v/kleat.svg\n        :target: https://pypi.python.org/pypi/kleat\n\n.. image:: https://img.shields.io/travis/zyxue/kleat.svg\n        :target: https://travis-ci.org/zyxue/kleat\n\n.. image:: https://coveralls.io/repos/github/zyxue/kleat/badge.svg\n        :target: https://coveralls.io/github/zyxue/kleat\n\n.. image:: https://readthedocs.org/projects/kleat/badge/?version=latest\n        :target: https://kleat.readthedocs.io/en/latest/?badge=latest\n        :alt: Documentation Status\n\nCleavage site prediction via de novo assembly.\n\n**Note**: this is a reimplementation from scratch of the following paper (PMID: 25592595_),\n\n.. _25592595: https://www.ncbi.nlm.nih.gov/pubmed/25592595\n\n- Birol I, Raymond A, Chiu R, Nip KM, Jackman SD, Kreitzman M, et al. KLEAT:\n  cleavage site analysis of transcriptomes. Pac Symp Biocomput. World\n  Scientific. 2015;:347–58.\n\nKLEAT works by\n\n1. Collect polyA evidences (suffix, bridge, link, and blank) and their\n   supporting clvs per contig\n2. Aggregate polyA evidence per clv identified by (seqname, strand, clv)\n3. Find the closest annotated clv per predicted clv and calculate the distance in between.\n\nSuffix was named tail in the original paper, but tail collides with tail of a\nbridge contig, which could be confused, so it is renamed to suffix in the code\nbase.\n\nBlank indicates the contig without any indication of polyA, but could still\nsupport predicting cleavage site since if a transcript is expressed, then a\ncleavage site likely exists nearby.\n\n..\n   memo: adding hyperlink to a sentence is really awkward in rst!\n\n..\n   * Documentation: https://kleat.readthedocs.io.\n\nInstall\n--------\n\nKLEAT (\u003e3.0.0) supports only Python3 (\u003e=py34). A few key packages include\npysam_, pandas_, scikit-learn_.\n\n.. _pysam: https://github.com/pysam-developers/pysam\n.. _pandas: https://github.com/pandas-dev/pandas\n.. _scikit-learn: https://github.com/scikit-learn/scikit-learn\n\nFirst, it's recommended to create a virtual environment, using either\nconda_\n\n.. _conda: https://conda.io/miniconda.html\n\n.. code-block:: bash\n\n   conda create -p venv-kleat python=3\n   source activate venv-kleat/\n\nor pip_ + virtualenv_\n\n.. _pip: https://github.com/pypa/pip\n.. _virtualenv: https://github.com/pypa/virtualenv\n\n.. code-block:: bash\n\n   pip install virtualenv # skip this step if virtualenv is available already\n   virtualenv venv-kleat\n   . venv-kleat/bin/activate\n\nThen install kleat with pip\n\n.. code-block:: bash\n\n   pip install git+https://github.com/zyxue/kleat.git#egg=kleat\n\nFeatures\n--------\n\n* Suffix (previously known as tail), bridge, link, blank\n* Search PAS hexamer on the contig\n* Hardclip regions are considered, too, and well tested.\n* Process all chromosomes in parallel, and parallized other steps as much as\n  possible (e.g. aggregate polyA evidence per clv)\n\nUsage\n-----\n\n* Inputs to `--contig-to-genome` and `--reads-to-contigs` should both be sorted\n  and indexed with samtools_.\n\n.. _samtools: http://samtools.sourceforge.net/\n\n\nNotes on result interpretation\n------------------------------\n\n`ctg_hex_pos`, the distance between contig PAS hexamer and clv is not\ninterpretable in terms of reference genome because there might be insertion and\ndeletion, but `ctg_hex_dist`, the distance between contig PAS hexamer and clv is\ninterpretation.\n  \nWe decided not to remove this column, and leave as a sanity check/indication of\nindels. Indels can also be inferred from the difference between `ctg_hex_dist` and\n`ref_hex_dist` if they exist.\n\n`ctg_hex_pos` may become especially useful when the ref_hex is not found, as it\ncan be used as a rough estimate of the location of the hexamer, helpful for\nlocating the PAS hexamer on the contig in a genome browser (e.g. IGV), quickly.\n\nDevelopment\n-----------\n\n.. code-block:: bash\n\n   virtualenv venv\n   . venv/bin/activate\n   pip install -r requirements_dev.txt\n   python setup.py develop\n\n.. code-block::\n\n   kleat --help\n\n   usage: kleat [-h] -c CONTIGS_TO_GENOME -r READS_TO_CONTIGS -f REFERENCE_GENOME\n                -a KARBOR_CLV_ANNOTATION [-o OUTPUT] [-m OUTPUT_FORMAT]\n                [-p NUM_CPUS] [--keep-pre-aggregation-tmp-file]\n                [--bridge-skip-check-size BRIDGE_SKIP_CHECK_SIZE]\n                [--cluster-first-then-aggregate]\n                [--cluster-cutoff CLUSTER_CUTOFF]\n\n   KLEAT: cleavage site detection via de novo assembly\n\n   optional arguments:\n     -h, --help            show this help message and exit\n     -c CONTIGS_TO_GENOME, --contigs-to-genome CONTIGS_TO_GENOME\n                           input contig-to-genome alignment BAM file\n     -r READS_TO_CONTIGS, --reads-to-contigs READS_TO_CONTIGS\n                           input read-to-contig alignment BAM file\n     -f REFERENCE_GENOME, --reference-genome REFERENCE_GENOME\n                           reference genome FASTA file, if provided, KLEAT will\n                           search polyadenylation signal (PAS) hexamer in both\n                           contig and reference genome, which is useful for\n                           checking mutations that may affect PAS hexmaer. Note\n                           this fasta file needs to be consistent with the one\n                           used for generating the read-to-contig BAM alignments\n     -a KARBOR_CLV_ANNOTATION, --karbor-clv-annotation KARBOR_CLV_ANNOTATION\n                           the annotated clv pickle formatted for karbor with\n                           (seqname, strand, clv, gene_ids, gene_names) columns\n                           this file is processed from GTF annotation file\n     -o OUTPUT, --output OUTPUT\n                           output tsv file, if not specified, it will use prefix\n                           output, and the extension depends on the value of\n                           --output-format. e.g. output.csv, output.pickle, etc.\n     -m OUTPUT_FORMAT, --output-format OUTPUT_FORMAT\n                           also support tsv, pickle (python)\n     -p NUM_CPUS, --num-cpus NUM_CPUS\n                           parallize the step of aggregating polya evidence for\n                           each (seqname, strand, clv)\n     --keep-pre-aggregation-tmp-file\n                           specify this if you would like to keep the tmp file\n                           before aggregating polyA evidence per cleavage site,\n                           mostly for debugging purpose\n     --bridge-skip-check-size BRIDGE_SKIP_CHECK_SIZE\n                           the size beyond which the clv is predicted be on the\n                           next matched region. Otherwise, clv is predicted to be\n                           at the edge of the prevision match. The argument is\n                           added because inconsistent break points are observed\n                           between read (softclip as in r2c alignment) and contig\n                           (boundry between BAM_CMATCH and BAM_CREF_SKIP)\n     --cluster-first-then-aggregate\n                           the default approach is aggregate_polya_evidence -\u003e\n                           filter -\u003e cluster, if this argument is specified, then\n                           the order becomes cluster -\u003e aggregate_polya_evidence\n                           -\u003e filter. The idea is that by clustering first,\n                           neighbouring clvs would result in stronger polyA\n                           evidence signal, but preliminary results show that it\n                           does not make a big difference. Also, note that if the\n                           data is noisy, cluster before filter would further\n                           decrease the clv resolution since single-linkage\n                           culstering would combine those noisy clvs in to the\n                           clvs with real signal\n     --cluster-cutoff CLUSTER_CUTOFF\n                           the cutoff for single-linkage clustering\n\nTo uninstall\n\n.. code-block:: bash\n\n   python setup.py develop --uninstall\n\n\nDebug instruction\n-----------------\n\nFor a particular contig, you could insert pdb such as below\n\n.. code-block::\n\n    @@ -32,6 +32,11 @@ def collect_polya_evidence(c2g_bam, r2c_bam, ref_fa, csvwriter, bridge_skip_chec\n             if contig.is_unmapped:\n                 continue\n\n    +        if contig.query_name == \"\u003cyour contig name\u003e\" and contig.reference_name == \"chrX\":\n    +            pass\n    +        else:\n    +            continue\n    +\n             ascs = []           # already supported clvs\n             rec = process_suffix(\n                 contig, r2c_bam, ref_fa, csvwriter)\n\n\nZero-based index\n----------------\n\nEvery index is 0-based, including ascii visualization such as\n\n.. code-block::\n\n   Symbols:\n   --: ref_skip\n   //: hardclip at right\n   \\\\: hardclip at left\n   __: deletion\n   ┬ : insertion\n    └: softclip at left\n    ┘: softclip at right\n\n   Abbreviation:\n    cc: ctg_clv, clv in contig coordinate\n    rc: ref_clv, clv in reference coordinate\n\n   icb: init_clv_beg, initialized beginning index in contig coordinate (for - strand clv)\n   irb: init_ref_beg, initialized beginning index in reference coordinate (for - strand clv)\n\n   ice: init_clv_end, initialized end index in contig coordinate (for + strand clv)\n   ire: init_ref_end, initialized end index in reference coordinate (for + strand clv)\n\n    TTT\n      └AT\n    89012 \u003c- one coord (0-based)\n      1   \u003c- ten coord\n\nwhich is different from the display on IGV that is 1-based (although its\nunderlying system is still 0-based_).\n\n.. _0-based: https://software.broadinstitute.org/software/igv/IGV.\n\n\nSome key concepts in the code:\n\n- ctg_clv: clv in contig coordinate including clipped regions and indels\n\n- gnm_clv: or ref_clv. clv in genome coordinate\n\n- gnm_offset: ctg_clv converted genome coordinate with proper handling of skips,\nclips, indels, so that gnm_offset is addable to the genome coordinate directly.\n\n\nCredits\n-------\n\nThis package was created with Cookiecutter_ and the `audreyr/cookiecutter-pypackage`_ project template.\n\n.. _Cookiecutter: https://github.com/audreyr/cookiecutter\n.. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzyxue%2Fkleat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzyxue%2Fkleat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzyxue%2Fkleat/lists"}