Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pachterlab/kb_python
A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://github.com/pachterlab/kb_python
bustools kallisto kb-python rna-velocity-estimation scrna-seq single-cell-rna-seq
Last synced: 3 months ago
JSON representation
A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
- Host: GitHub
- URL: https://github.com/pachterlab/kb_python
- Owner: pachterlab
- License: bsd-2-clause
- Created: 2019-10-14T23:50:12.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2024-07-15T16:06:15.000Z (4 months ago)
- Last Synced: 2024-07-23T11:03:49.572Z (3 months ago)
- Topics: bustools, kallisto, kb-python, rna-velocity-estimation, scrna-seq, single-cell-rna-seq
- Language: Python
- Homepage: https://www.kallistobus.tools/
- Size: 178 MB
- Stars: 141
- Watchers: 12
- Forks: 24
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-single-cell - kb-python - [Python] - `kb-python` is a python package for processing single-cell RNA-sequencing. It wraps the [`kallisto` | `bustools`](https://www.kallistobus.tools) single-cell RNA-seq command line tools in order to unify multiple processing workflows. (Software packages / RNA-seq)
README
# kb-python
![github version](https://img.shields.io/badge/Version-0.28.0-informational)
[![pypi version](https://img.shields.io/pypi/v/kb-python)](https://pypi.org/project/kb-python/0.28.0/)
![python versions](https://img.shields.io/pypi/pyversions/kb_python)
![status](https://github.com/pachterlab/kb_python/workflows/CI/badge.svg)
[![codecov](https://codecov.io/gh/pachterlab/kb_python/branch/master/graph/badge.svg)](https://codecov.io/gh/pachterlab/kb_python)
[![pypi downloads](https://img.shields.io/pypi/dm/kb-python)](https://pypi.org/project/kb-python/)
[![docs](https://readthedocs.org/projects/kb-python/badge/?version=latest)](https://kb-python.readthedocs.io/en/latest/?badge=latest)
[![license](https://img.shields.io/pypi/l/kb-python)](LICENSE)`kb-python` is a python package for processing single-cell RNA-sequencing. It wraps the [`kallisto` | `bustools`](https://www.kallistobus.tools) single-cell RNA-seq command line tools in order to unify multiple processing workflows.
`kb-python` was developed by [Kyung Hoi (Joseph) Min](https://twitter.com/lioscro) and [A. Sina Booeshaghi](https://twitter.com/sinabooeshaghi) while in [Lior Pachter](https://twitter.com/lpachter)'s lab at Caltech. If you use `kb-python` in a publication please [cite*](#cite):
```
Melsted, P., Booeshaghi, A.S., et al.
Modular, efficient and constant-memory single-cell RNA-seq preprocessing.
Nat Biotechnol 39, 813–818 (2021).
https://doi.org/10.1038/s41587-021-00870-2
```## Installation
The latest release can be installed with```bash
pip install kb-python
```The development version can be installed with
```bash
pip install git+https://github.com/pachterlab/kb_python
```There are no prerequisite packages to install. The `kallisto` and `bustools` binaries are included with the package.
## Usage
`kb` consists of four subcommands
```bash
$ kb
usage: kb [-h] [--list] ...
positional arguments:
info Display package and citation information
compile Compile `kallisto` and `bustools` binaries from source
ref Build a kallisto index and transcript-to-gene mapping
count Generate count matrices from a set of single-cell FASTQ files
```### `kb ref`: generate a pseudoalignment index
The `kb ref` command takes in a species annotation file (GTF) and associated genome (FASTA) and builds a species-specific index for pseudoalignment of reads. This must be run before `kb count`. Internally, `kb ref` extracts the coding regions from the GTF and builds a transcriptome FASTA that is then indexed with `kallisto index`.
```bash
kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa
```
- `` refers to a genome file (FASTA).
- For example, the zebrafish genome is hosted by [ensembl](https://uswest.ensembl.org/Danio_rerio/Info/Index) and can be downloaded [here](http://ftp.ensembl.org/pub/release-107/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz)
- `` refers to a genome annotation file (GTF)
- For example, the zebrafish genome annotation file is hosted by [ensembl](https://uswest.ensembl.org/Danio_rerio/Info/Index) and can be downloaded [here](http://ftp.ensembl.org/pub/release-107/gtf/danio_rerio/Danio_rerio.GRCz11.107.gtf.gz)
- **Note:** The latest genome annotation and genome file for every species on ensembl can be found with the [`gget`](https://github.com/pachterlab/gget) command-line tool.Prebuilt indices are available at https://github.com/pachterlab/kallisto-transcriptome-indices
#### Examples
```bash
# Index the transcriptome from genome FASTA (genome.fa.gz) and GTF (annotation.gtf.gz)
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa genome.fa.gz annotation.gtf.gz
# An example for downloading a prebuilt reference for mouse
$ kb ref -d mouse -i index.idx -g t2g.txt
```
---
### `kb count`: pseudoalign and count readsThe `kb count` command takes in the pseudoalignment index (built with `kb ref`) and sequencing reads generated by a sequencing machine to generate a count matrix. Internally, `kb count` runs numerous [`kallisto`](https://github.com/pachterlab/kallisto) and [`bustools`](https://github.com/BUStools/bustools/) commands comprising a single-cell workflow for the specified technology that generated the sequencing reads.
```bash
kb count -i index.idx -g t2g.txt -o out/ -x
```
- `` refers to the assay that generated the sequencing reads.
- For a list of supported assays run `kb --list`
- `` refers to the a list of FASTQ files generated
- Different assays will have a different number of FASTQ files
- Different assays will place the different features in different FASTQ files
- For example, sequencing a 10xv3 library on a NextSeq Illumina sequencer usually results in two FASTQ files.
- The `R1.fastq.gz` file (colloquially called "read 1") contains a 16 basepair cell barcode and a 12 basepair unique molecular identifier (UMI).
- The `R2.fastq.gz` file (colloquially called "read 2") contains the cDNA associated with the cell barcode-UMI pair in read 1.#### Examples
```bash
# Quantify 10xv3 reads read1.fastq.gz and read2.fastq.gz
$ kb count -i index.idx -g t2g.txt -o out/ -x 10xv3 read1.fastq.gz read2.fastq.gz
```
---
### `kb info`: display package and citation informationThe `kb info` command prints out package information including the version of `kb-python`, `kallisto`, and `bustools` along with their installation location.
```bash
$ kb info
kb_python 0.28.0 ...
kallisto: 0.50.1 ...
bustools: 0.43.1 ...
...
```
---
### `kb compile`: compile `kallisto` and `bustools` binaries from source
The `kb compile` command grabs the latest `kallisto` and `bustools` source and compiles the binaries. **Note**: this is not required to run `kb-python`.## Use cases
`kb-python` facilitates fast and uniform pre-processing of single-cell sequencing data to answer relevant research questions.
```bash
$ pip install kb-python gget ffq# Goal: quantify publicly available scRNAseq data
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa $(gget ref --ftp -w dna,gtf homo_sapiens)
$ kb count -i index.idx -g t2g.txt -x 10xv3 -o out $(ffq --ftp SRR10668798 | jq -r '.[] | .url' | tr '\n' ' ')
# -> count matrix in out/ folder# Goal: quantify 10xv2 feature barcode data, feature_barcodes.txt is a tab-delimited file
# containing barcode_sequencebarcode_name
$ kb ref -i index.idx -g f2g.txt -f1 features.fa --workflow kite feature_barcodes.txt
$ kb count -i index.idx -g f2b.txt -x 10xv2 -o out/ --workflow kite --h5ad R1.fastq.gz R2.fastq.gz
# -> count matrix in out/ folder
```
Submitted by [@sbooeshaghi](https://github.com/sbooeshaghi/).Do you have a cool use case for `kb-python`? Submit a PR (including the goal, code snippet, and your username) so that we can feature it here.
## Tutorials
For a list of tutorials that use `kb-python` please see [https://www.kallistobus.tools/](https://www.kallistobus.tools/).## Documentation
Developer documentation is hosted on [Read the Docs](https://kb-python.readthedocs.io/en/latest/).## Contributing
Thank you for wanting to improve `kb-python`! If you have believe you've found a bug, please submit an issue.If you have a new feature you'd like to add to `kb-python` please create a pull request. Pull requests should contain a message detailing the exact changes made, the reasons for the change, and tests that check for the correctness of those changes.
# Cite
If you use `kb-python` in a publication, please cite the following papers:`kb-python` & `kallisto` and/or `bustools`
```
@article{sullivan2023kallisto,
title={kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq},
author={Sullivan, Delaney K and Min, Kyung Hoi and Hj{\"o}rleifsson, Kristj{\'a}n Eldj{\'a}rn and Luebbert, Laura and Holley, Guillaume and Moses, Lambda and Gustafsson, Johan and Bray, Nicolas L and Pimentel, Harold and Booeshaghi, A Sina and others},
journal={bioRxiv},
pages={2023--11},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
````bustools`
```tex
@article{melsted2021modular,
title={\href{https://doi.org/10.1038/s41587-021-00870-2}{Modular, efficient and constant-memory single-cell RNA-seq preprocessing}},
author={Melsted, P{\'a}ll and Booeshaghi, A. Sina and Liu, Lauren and Gao, Fan and Lu, Lambda and Min, Kyung Hoi Joseph and da Veiga Beltrame, Eduardo and Hj{\"o}rleifsson, Kristj{\'a}n Eldj{\'a}rn and Gehring, Jase and Pachter, Lior},
author+an={1=first;2=first,highlight},
journal={Nature biotechnology},
year={2021},
month={4},
day={1},
doi={https://doi.org/10.1038/s41587-021-00870-2}
}
````kallisto`
```tex
@article{bray2016near,
title={Near-optimal probabilistic RNA-seq quantification},
author={Bray, Nicolas L and Pimentel, Harold and Melsted, P{\'a}ll and Pachter, Lior},
journal={Nature biotechnology},
volume={34},
number={5},
pages={525--527},
year={2016},
publisher={Nature Publishing Group}
}
````kITE`
```tex
@article{booeshaghi2024quantifying,
title={Quantifying orthogonal barcodes for sequence census assays},
author={Booeshaghi, A Sina and Min, Kyung Hoi and Gehring, Jase and Pachter, Lior},
journal={Bioinformatics Advances},
volume={4},
number={1},
pages={vbad181},
year={2024},
publisher={Oxford University Press}
}
````BUS` format
```tex
@article{melsted2019barcode,
title={The barcode, UMI, set format and BUStools},
author={Melsted, P{\'a}ll and Ntranos, Vasilis and Pachter, Lior},
journal={Bioinformatics},
volume={35},
number={21},
pages={4472--4473},
year={2019},
publisher={Oxford University Press}
}
````kb-python` was inspired by Sten Linnarsson’s `loompy fromfq` command (http://linnarssonlab.org/loompy/kallisto/index.html)