{"id":13675841,"url":"https://github.com/pachterlab/kb_python","last_synced_at":"2025-05-16T03:05:27.474Z","repository":{"id":35614184,"uuid":"215165758","full_name":"pachterlab/kb_python","owner":"pachterlab","description":"A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing","archived":false,"fork":false,"pushed_at":"2025-05-11T04:07:51.000Z","size":238644,"stargazers_count":161,"open_issues_count":4,"forks_count":24,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-05-11T05:18:25.864Z","etag":null,"topics":["bustools","kallisto","kb-python","rna-velocity-estimation","scrna-seq","single-cell-rna-seq"],"latest_commit_sha":null,"homepage":"https://www.kallistobus.tools/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pachterlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-10-14T23:50:12.000Z","updated_at":"2025-05-11T04:06:03.000Z","dependencies_parsed_at":"2023-10-20T16:02:07.317Z","dependency_job_id":"a514816c-51a7-4a05-a428-475c0646472b","html_url":"https://github.com/pachterlab/kb_python","commit_stats":{"total_commits":196,"total_committers":5,"mean_commits":39.2,"dds":"0.48469387755102045","last_synced_commit":"14a18ef36a943160378ccbdf12a0fd6b6b84e449"},"previous_names":[],"tags_count":38,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pachterlab%2Fkb_python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pachterlab%2Fkb_python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pachterlab%2Fkb_python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pachterlab%2Fkb_python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pachterlab","download_url":"https://codeload.github.com/pachterlab/kb_python/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253519694,"owners_count":21921211,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bustools","kallisto","kb-python","rna-velocity-estimation","scrna-seq","single-cell-rna-seq"],"created_at":"2024-08-02T12:01:04.913Z","updated_at":"2025-05-16T03:05:22.464Z","avatar_url":"https://github.com/pachterlab.png","language":"Python","readme":"# kb-python\n![github version](https://img.shields.io/badge/Version-0.29.2-informational)\n[![pypi version](https://img.shields.io/pypi/v/kb-python)](https://pypi.org/project/kb-python/0.29.2/)\n![python versions](https://img.shields.io/pypi/pyversions/kb_python)\n![status](https://github.com/pachterlab/kb_python/workflows/CI/badge.svg)\n[![codecov](https://codecov.io/gh/pachterlab/kb_python/branch/master/graph/badge.svg)](https://codecov.io/gh/pachterlab/kb_python)\n[![pypi downloads](https://img.shields.io/pypi/dm/kb-python)](https://pypi.org/project/kb-python/)\n[![docs](https://readthedocs.org/projects/kb-python/badge/?version=latest)](https://kb-python.readthedocs.io/en/latest/?badge=latest)\n[![license](https://img.shields.io/pypi/l/kb-python)](LICENSE)\n\n`kb-python` is a python package for processing single-cell RNA-sequencing. It wraps the [`kallisto` | `bustools`](https://www.kallistobus.tools) single-cell RNA-seq command line tools in order to unify multiple processing workflows. \n\n`kb-python` was first developed by [Kyung Hoi (Joseph) Min](https://twitter.com/lioscro) and [A. Sina Booeshaghi](https://twitter.com/sinabooeshaghi) while in [Lior Pachter](https://twitter.com/lpachter)'s lab at Caltech. If you use `kb-python` in a publication please [cite*](#cite):\n```\nMelsted, P., Booeshaghi, A.S., et al. \nModular, efficient and constant-memory single-cell RNA-seq preprocessing. \nNat Biotechnol  39, 813–818 (2021). \nhttps://doi.org/10.1038/s41587-021-00870-2\n```\n\n## Installation\nThe latest release can be installed with\n\n```bash\npip install kb-python\n```\n\nThe development version can be installed with\n```bash\npip install git+https://github.com/pachterlab/kb_python\n```\n\nThere are no prerequisite packages to install. The `kallisto` and `bustools` binaries are included with the package.\n\n## Usage\n\n`kb`  consists of five subcommands\n```bash\n$ kb\nusage: kb [-h] [--list] \u003cCMD\u003e ...\npositional arguments:\n  \u003cCMD\u003e\n    info      Display package and citation information\n    compile   Compile `kallisto` and `bustools` binaries from source\n    ref       Build a kallisto index and transcript-to-gene mapping\n    count     Generate count matrices from a set of single-cell FASTQ files\n    extract   Extract reads that were pseudoaligned to specific genes/transcripts (or extract all reads that were / were not pseudoaligned)\n```\n\n### `kb ref`: generate a pseudoalignment index\n\nThe `kb ref` command takes in a species annotation file (GTF) and associated genome (FASTA) and builds a species-specific index for pseudoalignment of reads. This must be run before `kb count`. Internally, `kb ref` extracts the coding regions from the GTF and builds a transcriptome FASTA that is then indexed with `kallisto index`.\n\n```bash\nkb ref -i index.idx -g t2g.txt -f1 transcriptome.fa \u003cGENOME\u003e \u003cGENOME_ANNOTATION\u003e\n```\n-  `\u003cGENOME\u003e` refers to a genome file (FASTA).\n\t- For example, the zebrafish genome is hosted by [ensembl](https://uswest.ensembl.org/Danio_rerio/Info/Index) and can be downloaded [here](http://ftp.ensembl.org/pub/release-107/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz)\n- `\u003cGENOME_ANNOTATION\u003e` refers to a genome annotation file (GTF)\n\t- For example, the zebrafish genome annotation file is hosted by [ensembl](https://uswest.ensembl.org/Danio_rerio/Info/Index) and can be downloaded [here](http://ftp.ensembl.org/pub/release-107/gtf/danio_rerio/Danio_rerio.GRCz11.107.gtf.gz)\n- **Note:** The latest genome annotation and genome file for every species on ensembl can be found with the [`gget`](https://github.com/pachterlab/gget) command-line tool.\n\nPrebuilt indices are available at https://github.com/pachterlab/kallisto-transcriptome-indices\n\n#### Examples\n```bash\n# Index the transcriptome from genome FASTA (genome.fa.gz) and GTF (annotation.gtf.gz)\n$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa genome.fa.gz annotation.gtf.gz\n# An example for downloading a prebuilt reference for mouse\n$ kb ref -d mouse -i index.idx -g t2g.txt\n```\n---\n### `kb count`: pseudoalign and count reads\n\nThe `kb count` command takes in the pseudoalignment index (built with `kb ref`) and sequencing reads generated by a sequencing machine to generate a count matrix. Internally, `kb count` runs numerous [`kallisto`](https://github.com/pachterlab/kallisto) and [`bustools`](https://github.com/BUStools/bustools/) commands comprising a single-cell workflow for the specified technology that generated the sequencing reads.\n\n```bash\nkb  count -i index.idx -g t2g.txt -o out/ -x \u003cTECHNOLOGY\u003e \u003cFASTQ FILE[s]\u003e\n```\n-  `\u003cTECHNOLOGY\u003e` refers to the assay that generated the sequencing reads.\n\t- For a list of supported assays run `kb --list`\n- `\u003cFASTQ FILE[s]\u003e` refers to the a list of FASTQ files generated \n\t- Different assays will have a different number of FASTQ files \n\t- Different assays will place the different features in different FASTQ files\n\t\t- For example, sequencing a 10xv3 library on a NextSeq Illumina sequencer usually results in two FASTQ files. \n\t\t- The `R1.fastq.gz` file (colloquially called \"read 1\") contains a 16 basepair cell barcode and a 12 basepair unique molecular identifier (UMI). \n\t\t- The `R2.fastq.gz` file (colloquially called \"read 2\") contains the cDNA associated with the cell barcode-UMI pair in read 1.\n\n#### Examples\n```bash\n# Quantify 10xv3 reads read1.fastq.gz and read2.fastq.gz\n$ kb count -i index.idx -g t2g.txt -o out/ -x 10xv3 read1.fastq.gz read2.fastq.gz\n```\n---\n### `kb info`: display package and citation information\n\nThe `kb info` command prints out package information including the version of `kb-python`, `kallisto`, and `bustools` along with their installation location.\n\n```bash\n$ kb info\nkb_python 0.29.2 ...\nkallisto: 0.51.1 ...\nbustools: 0.44.1 ...\n...\n```\n---\n### `kb compile`: compile `kallisto` and `bustools` binaries from source\nThe `kb compile` command grabs the latest `kallisto` and `bustools` source and compiles the binaries. **Note**: this is not required to run `kb-python`.\n\n## Use cases\n`kb-python` facilitates fast and uniform pre-processing of single-cell sequencing data to answer relevant research questions. \n```bash\n$ pip install kb-python gget ffq\n\n# Goal: quantify publicly available scRNAseq data\n$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa $(gget ref --ftp -w dna,gtf homo_sapiens)\n$ kb count -i index.idx -g t2g.txt -x 10xv3 -o out $(ffq --ftp SRR10668798 | jq -r '.[] | .url' | tr '\\n' ' ')\n# -\u003e count matrix in out/ folder\n\n# Goal: quantify 10xv2 feature barcode data, feature_barcodes.txt is a tab-delimited file\n# containing barcode_sequence\u003ctab\u003ebarcode_name\n$ kb ref -i index.idx -g f2g.txt -f1 features.fa --workflow kite feature_barcodes.txt\n$ kb count -i index.idx -g f2b.txt -x 10xv2 -o out/ --workflow kite --h5ad R1.fastq.gz R2.fastq.gz\n# -\u003e count matrix in out/ folder\n```\nSubmitted by [@sbooeshaghi](https://github.com/sbooeshaghi/).\n\nDo you have a cool use case for `kb-python`? Submit a PR (including the goal, code snippet, and your username) so that we can feature it here.\n\n## Tutorials\nFor a list of tutorials that use `kb-python` please see [https://www.kallistobus.tools/](https://www.kallistobus.tools/).\n\n## Documentation\nDeveloper documentation is hosted on [Read the Docs](https://kb-python.readthedocs.io/en/latest/).\n\n## Contributing\nThank you for wanting to improve `kb-python`! If you have believe you've found a bug, please submit an issue. \n\nIf you have a new feature you'd like to add to `kb-python` please create a pull request. Pull requests should contain a message detailing the exact changes made, the reasons for the change, and tests that check for the correctness of those changes.\n\n# Cite\nIf you use `kb-python` in a publication, please cite the following papers:\n\n`kb-python`  \u0026 `kallisto` and/or `bustools`\n```\n@article{sullivan2023kallisto,\n  title={kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq},\n  author={Sullivan, Delaney K and Min, Kyung Hoi and Hj{\\\"o}rleifsson, Kristj{\\'a}n Eldj{\\'a}rn and Luebbert, Laura and Holley, Guillaume and Moses, Lambda and Gustafsson, Johan and Bray, Nicolas L and Pimentel, Harold and Booeshaghi, A Sina and others},\n  journal={bioRxiv},\n  pages={2023--11},\n  year={2023},\n  publisher={Cold Spring Harbor Laboratory}\n}\n```\n\n`bustools` \n```tex\n@article{melsted2021modular,\n  title={\\href{https://doi.org/10.1038/s41587-021-00870-2}{Modular, efficient and constant-memory single-cell RNA-seq preprocessing}},\n  author={Melsted, P{\\'a}ll and Booeshaghi, A. Sina and Liu, Lauren and Gao, Fan and Lu, Lambda and Min, Kyung Hoi Joseph and da Veiga Beltrame, Eduardo and Hj{\\\"o}rleifsson, Kristj{\\'a}n Eldj{\\'a}rn and Gehring, Jase and Pachter, Lior},\n  author+an={1=first;2=first,highlight},\n  journal={Nature biotechnology},\n  year={2021},\n  month={4},\n  day={1},\n  doi={https://doi.org/10.1038/s41587-021-00870-2}\n}\n```\n\n`kallisto` \n```tex\n@article{bray2016near,\n  title={Near-optimal probabilistic RNA-seq quantification},\n  author={Bray, Nicolas L and Pimentel, Harold and Melsted, P{\\'a}ll and Pachter, Lior},\n  journal={Nature biotechnology},\n  volume={34},\n  number={5},\n  pages={525--527},\n  year={2016},\n  publisher={Nature Publishing Group}\n}\n```\n\n`kITE`\n```tex\n@article{booeshaghi2024quantifying,\n  title={Quantifying orthogonal barcodes for sequence census assays},\n  author={Booeshaghi, A Sina and Min, Kyung Hoi and Gehring, Jase and Pachter, Lior},\n  journal={Bioinformatics Advances},\n  volume={4},\n  number={1},\n  pages={vbad181},\n  year={2024},\n  publisher={Oxford University Press}\n}\n```\n\n`BUS` format\n```tex\n@article{melsted2019barcode,\n  title={The barcode, UMI, set format and BUStools},\n  author={Melsted, P{\\'a}ll and Ntranos, Vasilis and Pachter, Lior},\n  journal={Bioinformatics},\n  volume={35},\n  number={21},\n  pages={4472--4473},\n  year={2019},\n  publisher={Oxford University Press}\n}\n```\n\n`kb-python` was inspired by Sten Linnarsson’s `loompy fromfq` command (http://linnarssonlab.org/loompy/kallisto/index.html)\n","funding_links":[],"categories":["Software packages"],"sub_categories":["RNA-seq"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpachterlab%2Fkb_python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpachterlab%2Fkb_python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpachterlab%2Fkb_python/lists"}