https://github.com/mvinyard/gtfast
Lift annotations from a GTF file to an adata object.
https://github.com/mvinyard/gtfast
anndata gencode genomics gff gtf python single-cell
Last synced: 2 months ago
JSON representation
Lift annotations from a GTF file to an adata object.
- Host: GitHub
- URL: https://github.com/mvinyard/gtfast
- Owner: mvinyard
- License: mit
- Created: 2022-01-04T16:25:34.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-05-12T15:19:13.000Z (about 3 years ago)
- Last Synced: 2025-02-25T16:15:07.541Z (3 months ago)
- Topics: anndata, gencode, genomics, gff, gtf, python, single-cell
- Language: Python
- Homepage:
- Size: 48.8 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 
[](https://pypi.python.org/pypi/gtfast/)
[](https://badge.fury.io/py/gtfast)
[](https://github.com/psf/black)### Installation
To install via [pip](https://pypi.org/project/gtfast):
```BASH
pip install gtfast
```To install the development version:
```BASH
git clone https://github.com/mvinyard/gtfast.gitcd gtfast; pip install -e .
```## Example usage
### Parsing a `.gtf` file
```python
import gtfastgtf_filepath = "/path/to/ref/hg38/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/genes/genes.gtf"
```
If this is your first time using `gtfast`, run:
```python
gtf = gtfast.parse(path=gtf_filepath, genes=False, force=False, return_gtf=True)
```
Running this function will create two `.csv` files from the given `.gtf` files - one containing all feature types and one containing only genes. Both of these files are smaller than a `.gtf` and can be loaded into memory much faster using `pandas.read_csv()` (shortcut implemented in the next function). Additionally, this function leaves a paper trail for `gtfast` to find the newly-created `.csv` files again in the future such that one does not need to pass a path to the gtf.In the scenario in which you've already run the above function, run:
```python
gtf = gtfast.load() # no path necessary!
```### Interfacing with [AnnData](https://anndata.readthedocs.io/en/stable/) and updating an `adata.var` table.
If you're workign with single-cell data, you can easily lift annotations from a **[`gtf`](https://en.wikipedia.org/wiki/Gene_transfer_format)** to your **[`adata`](https://anndata.readthedocs.io/en/stable/)** object.
```python
from anndata import read_h5ad
import gtfastadata = read_h5ad("/path/to/singlecell/data/adata.h5ad")
gtf = gtfast.load(genes=True)gtfast.add(adata, gtf)
```Since the `gtfast` distribution already knows where the `.csv / .gtf` files are, we could directly annotate `adata` without first specifcying `gtf` as a DataFrame, saving a step but I think it's more user-friendly to see what each one looks like, first.
### Working advantage
Let's take a look at the time difference of loading a `.gtf` into memory as a `pandas.DataFrame`:
```python
import gtfast
import gtfparse
import timestart = time.time()
gtf = gtfparse.read_gtf("/home/mvinyard/ref/hg38/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/genes/genes.gtf")
stop = time.time()print("baseline loading time: {:.2f}s".format(stop - start), end='\n\n')
start = time.time()
gtf = gtfast.load()
stop = time.time()print("GTFast loading time: {:.2f}s".format(stop - start))
```
```
baseline loading time: 87.54sGTFast loading time: 12.46s
```
~ 7x speed improvement.* **Note**: This is not meant to criticize or comment on anything related to [`gtfparse`](https://github.com/openvax/gtfparse) - in fact, this library relies solely on `gtfparse` for the actual parsing of a `.gtf` file into memory as `pandas.DataFrame` and it's an amazing tool for python developers!
### Contact
If you have suggestions, questions, or comments, please reach out to Michael Vinyard via [email](mailto:[email protected])